erc-dharma / tfc-khmer-epigraphy

This repository assembles data produced by the project Corpus des inscriptions khmères (before and during the DHARMA project).
https://dharma.hypotheses.org/
Creative Commons Attribution 4.0 International
2 stars 0 forks source link

encoding of "Numeral symbols other than decimal digits" (EGD 4.2.2) #39

Open arlogriffiths opened 1 month ago

arlogriffiths commented 1 month ago

@michaelnmmeyer — in tfc-khmer-epigraphy, there is a massive number of <num> elements whose contents are made up of symbols other than decimal digits that have not been wrapped in <g type="numeral"> by the responsible encoder(s) as EGD 4.2.2 prescribes. Examples:

There are also cases like <num value="80">80</num> which look like they contain decimal digits but where the transliteration is probably a representation of a non-decimal notation system, and so ought to be <num value="80"><g type-numeral>80</g></num> (as in the 123 example above). But there is no way for a machine to tell that these are not decimal units.

@chhomkunthea : do we ever have numbers noted with the decimal system outside of dates in the Khmer corpus? If we do not, then all such cases can automatically be converted to the encoding with <g>. You seem to have ignored EGD 4.2.2 so far. Please re-read it carefully.

Can you process the xml files and apply <g> wherever an algorithm can determine that the contents of <num> is not (explusively) a series of decimal digits?

@danbalogh : please correct me if I have made any mistake in my representation of our encoding rules.

@chloechollet and @salomepichon: please take note of the above if you weren't aware of the rules yet.

chhomkunthea commented 1 month ago

Dear Arlo,

As far as I know, the numerals in Khmer corpus are not written with decimal system, except dates. Salomé and Chloé may confirm this. Thank you for finalising the encoding of numerals, especially the number I. I will check the EG again before encoding next inscriptions with numerals.

Best, Kunthea

danbalogh commented 1 month ago

Yes, the above notes conform to our encoding guidelines.

arlogriffiths commented 1 month ago

in that case, @michaelnmmeyer, please wrap in <g type="numeral"> all contents of <num> other than strings of 3 or 4 arabic numeral (as such string are liable to be dates in the first or second millennium of the Śaka era and, as Kunthea comments, Śaka dates are normally expressed with decimal digits).

michaelnmmeyer commented 1 month ago

This is addressed in e71eaed30c889b390d14eb796ffe56c442b72fc3. There remains a number of occurrences to check and correct manually, to wit:

arlogriffiths commented 4 weeks ago

Thanks. I have converted the above into a task list and will take car of it.

arlogriffiths commented 4 weeks ago

@chhomkunthea : I don't understand the cases

arlogriffiths commented 4 weeks ago

@michaelnmmeyer:

chhomkunthea commented 4 weeks ago

Dear Arlo,

In the case of K. 915, I would like to propose below:

<num value="14"><g type="numeral">10</g> <g type="numeral">I</g><unclear><g type="numeral">III</g></unclear></num>

And for K. 1017, it should be:

<num value="17"><g type="numeral">10</g> <unclear>7</unclear></num>

arlogriffiths commented 4 weeks ago

@chhomkunthea : thanks. I have implemented your suggestion in K. 915 (or rather cleaned up the file which had some conflicts after you had implemented your suggestions). @danbalogh : do you approve of Kunthea' solution to avoid the problem that <unclear> cannot be used inside <g>?

danbalogh commented 4 weeks ago

I think I would prefer <num value="14"><g type="numeral">10</g> <unclear><g type="numeral">IIII</g></unclear></num> because if it were clear, the encoding of the latter part would be IIII. So that tells me that "IIII" is interpreted as a single numeral glyph, and if part of that is unclear, then the whole glyph is unclear. I hope you understand what I'm trying to say here; it's a bit difficult to express. It's analogous to how an Indian numeral 3 might be written as three lines one below the other, ≡ - and if one or two of those lines were unclear, you would put the 3 in unclear tags without trying to indicate that in fact the top bar is clear and the other two are not. Viewed the other way round, if we removed only the <unclear> from <num value="14"><g type="numeral">10</g> <g type="numeral">I</g><unclear><g type="numeral">III</g></unclear></num>, we'd be left with <num value="14"><g type="numeral">10</g> <g type="numeral">I</g><g type="numeral">III</g></num>, which I believe does not really make sense. That said, I do understand that Kunthea's rationale in choosing the above encoding was to show that the first bar is clear and the other three are not, and I don't have a strong objection to that. So if you are happy with that solution, I think it can stay. I don't suppose we want, at this stage, to revise the encoding of these numeral bars to say that only one I can ever be wrapped in g, and I must be iterated for every single bar. That would be the only way I see that would allow us to encode that only the latter 3 bars are unclear, but simultaneously also to keep the encoding rigorous (so that the unclear tag can be removed without requiring that the remaining text be rewritten). Thanks for bearing with me. I've been thinking aloud. Bottom line: out of 3 alternatives [1, use unclear around a g with four bars; 2, keep Kunthea's way; 3, revise the encoding method] my order of preference is 1-2-3.