<num> with numbers written in letters

ajaniak commented 4 years ago

Dear @chloechollet and @chhomkunthea,

In some of the XML files, you have used the tag <num> to encode numbers written in letters. The EG states that only numbers written with arabic numbers or symbols should be encoded.

If you need to keep such an encoding, let me know, I will figure something out for you.

Best, Axelle

chhomkunthea commented 4 years ago

Dear Axelle,

I think we need to keep the encoding. But let’s hear from Chloe and Arlo too.

Have a nice evening, Kunthea

Le 13 févr. 2020 à 17:12, ajaniak notifications@github.com a écrit :

Dear @chloechollet https://github.com/chloechollet and @chhomkunthea https://github.com/chhomkunthea,

In some of the XML files, you have used the tag to encode numbers written in letters. The EG states that only numbers written with arabic numbers or symbols should be encoded.

If you need to keep such an encoding, let me know, I will figure something out for you.

Best, Axelle

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/erc-dharma/tfc-khmer-epigraphy/issues/6?email_source=notifications&email_token=AM4GVNY5EQ2AIJLSZKIEH53RCUMH7A5CNFSM4KUO7PV2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4INGXKNA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM4GVN4CQ3UYPVIARPTWAC3RCUMH7ANCNFSM4KUO7PVQ.

chloechollet commented 4 years ago

Dear Axelle, I also think that we have to keep the encoding, as we decided to make the difference between numbers written with figures (that will appear as arabic numbers in our editions) and those written with "sticks" like "III" for the 3 number. Shall we modify something to make it more clear ?

arlogriffiths commented 4 years ago

if that is what we talking about (numbers like I, II, III encoded with , then indeed we need to keep the mark-up.

when Axelle told me about ’numbers written in letters’ marked up with , I thought she was talking about chronograms.

Axelle: please give some concrete examaple of the phenomenon of ‘ with numbers written in letters’, so we can decide whether should stay or go.

Le 15 févr. 2020 à 06:08, chloechollet notifications@github.com<mailto:notifications@github.com> a écrit :

Dear Axelle, I also think that we have to keep the encoding, as we decided to make the difference between numbers written with figures (that will appear as arabic numbers in our editions) and those written with "sticks" like "III" for the 3 number. Shall we modify something to make it more clear ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/erc-dharma/tfc-khmer-epigraphy/issues/6?email_source=notifications&email_token=AAGMAE6QGN65DUSLCZDX4Y3RC52ENA5CNFSM4KUO7PV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL3B4DY#issuecomment-586554895, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAGMAE76XL4HNSTH4DJ4U53RC52ENANCNFSM4KUO7PVQ.

ajaniak commented 4 years ago

I am not talking about the symbols as III. However, some content of the <num>are neither numbers neither symbols, if you want to keep encoding it, you have to tell Daniel to change his encoding guide.

arlogriffiths commented 4 years ago

so please point us to a few such cases, Axelle, in order that we all know what we are talking about

arlogriffiths commented 4 years ago

@ajaniak : shall we try to do what is necessary to close this issue? Please give us a few examples of the cases you have seen.

ajaniak commented 4 years ago

Dear all,

K 215: like <num value="8">praṁpiya</num> and <num value="4">pvān</num> (6 cases identified)
K 216-S: like <num value="3">piy∙</num>. (3 cases)
K 607 : <num value="1">moy·</num>(1 case)
K 1238 : vyara (27 cases without counting this case <num value="557">slik· I 100 40 10 7</num> combining both)
K 1240: <num value="616">ṣodaśottaraṣaṭśata</num> (1 case)
K 1256 : <num value="20">bhaiḥ</num> (1 case)

arlogriffiths commented 4 years ago

Thanks a lot Axelle. You are right that all these examples ignore the rules explicitly formulated in EG §7.1/Numbers expressed in words. However, @danbalogh , I think these examples force us to reconsider and refine our rules.

All of these examples, except the one from K. 1240, come from quantified lists of items owned by or given to a certain person or institution. The case from K. 1238 shows that the calculation of @value needs to take words into account (slik means 400).

Do you, @danbalogh, see any way we could allow cases in quantified lists, especially composite numbers expressed partly in words and partly in number signs? We could still maintain prohibition of applying <num> to chronograms.

danbalogh commented 4 years ago

I have no objection to people putting <num> around words and I'm happy to permit it in the EG if you want to do it. My aversion to doing so is based on the following considerations:

I'm not sure we'll ever put it to some purpose, and unless you are quite sure we will, I would prefer to avoid this complication
I'm sure there will be a number of complicated cases, including (but probably not limited to) the following:
- words not belonging to the numeral expression mixed up with the numeral words
- containers such as <l> interrupting a numeral expression
- uncertain/ambiguous interpretation of numerical expressions
- lacunae interfering with such expressions

Some of the above concerns are in fact there in the EG text, If you say we need this option, and can accept that we cannot plan for every case that may occur and will probably encounter situations where encoding will not be possible or will have to be done in an arbitrary and ad hoc fashion - then I'll revise the EG text and say this can be done optionally.

danbalogh commented 4 years ago

Here's a good example of a more complex one, from CalE05-Aihole-Pulakesin2 in Badami Calukya epigraphy:

<lg n="34" met="anuṣṭubh">
   <l n="a">pañcāśatsu kalau kāle <space/></l>
   <l n="b">ṣaṭsu pañca-śatāsu ca</l>
   <l n="c">samāsu samatītāsu <space/></l>
   <l n="d">śakānām api bhū-bhujāM<g type="symbol" subtype="dash"/></l>
</lg>

arlogriffiths commented 3 years ago

@danbalogh and @ajaniak : has the situation evolved at all since last year? It still seems to me it makes sense to allow use of <num> for such cases as <num value="557">slik· I 100 40 10 7</num>, without making it mandatory on any numeral expression that wholly or partly consist of words.

danbalogh commented 3 years ago

My stance is the same as it was: I don't think it is a good idea, but if you want it, I don't mind putting it in the EG. But we will not be able to come up with objective rules for handling all sorts of complex cases (see e.g. my verse example above), and if this sort of thing will be optional and to be handled on an ad hoc basis, then I don't really see what advantage it might serve (e.g. research, display?). I have written up a possible alternative to EGD §7.1.4 - Arlo, please have a look there and see if you like it. Also please reply to my comment there. We could limit it to allow this encoding only for combinations of numerals on words (which may be what you have in mind now), and thereby reduce the twilight zone, but that seems like a very arbitrary restriction to me.

arlogriffiths commented 3 years ago

Thanks Dan.

in terms of research questions that might justify allowing encoding of numeral expressions formulated in words, I can imagine people wanting to query our data for certain types of measurement and the values assigned thereto: "give me all donations of ghee for more than 20 kg". This might be more relevant in some vernacular-language corpora than in Sanskrit inscriptions.
from that point of view, I am not sure it is a good idea to propose a limitation to combinations of numerals and numeral words (I suppose that is what you intended to write)

I will now look at your stub in in EGD 7.1.4.

danbalogh commented 3 years ago

Yes, I did mean combinations of numeral signs and numeral words. For that kind of research question, using <measure> (EGD §7.4.4) is much better suited, as it can work independently of num, specify "ghee" and use "kg". At any rate, I do agree that once we accept <num> for number-words, then we should allow it in general, and not only when combined with numeral signs. We could even go the whole hog and use <num> within <num> for bhūtasaṁkhyā, to encode the value of each word separately in addition to encoding the value of the whole, e.g. śākeṣv abdeṣu yāteṣv atha <num value="814"><num value="14">manu</num>-<num value="8">vasu</num></num>-saṁprāpta-saṁkhyeṣu meṣe My only problem remains that encoding multiple numeral words together is complicated, and it will run into problems like that in my example above, which I see no way of solving apart from using linking mechanism involving @xml:id, which takes the encoding to a whole new level of complexity.

arlogriffiths commented 6 months ago

@michaelnmmeyer @danbalogh : it would probably be good to bring this old discussion to a close. Maybe one way to start would be to ask Michael to generate a list of all cases where we have non-number (and non roman numeral) contents of <num>.

danbalogh commented 6 months ago

@arlogriffiths , it is not clear what you are asking us to do or why you think such a list generated by Michaël may help. I've read through the issue from the beginning, and the way I see it is that this discussion was brought to a close back then, only nobody pressed the Close button.

The EGD (§7.1.4) has already been revised (on 17 May 2021 according to my comment there) to permit text within <num>. I don't recall whether you had offered comments on my stub back then (can check if it's really important to you, but would rather not try to find this among the thousands of completed comments), but at any rate, the revision was finalised almost two years ago, and I probably would not have done that unless you had in some way affirmed that this was what you wanted.

My concerns remain what I stated above. I can live with these, but you need to be aware that you and everyone else will need to live with them too, and living with them includes not calling on me to devise new ad-hoc solutions every time a complication turns up. The concerns are:

There will always be cases where the contents of the <num> element will include words that have nothing to do with numbers. There is nothing we can do about that, unless we want to start using multiple num tags and linking them with xml:id. I want to avoid that, so we'll have to live with either having non-numeral meanings within the num tag, or not tagging the numeral expression unless is spatially contiguous.
There will also be cases where the num markup intersects with other markup (e.g. unclear or supplied). We can deal with this by fragmenting the other tags where necessary, but doing so will probably complicate all kinds of processing including display.
Worse, there will also be cases of overlapping hierarchies where a number expressed in words extends over a metrical boundary (e.g. from one verse line to the next). See my example of 14 April 2020 above. Again, there's nothing we can do about that unless we split tags and link them, which I want to avoid, so in such cases we'll either have to accept tagging only part of a numeral expression (so that some of the actual numeral expression will be outside the tag), or not tagging the numeral expression at all.
I foresee no advantage as regards processing or research. The kind of research question you give as an example above should be catered for by encoding commodities and measures. This is admittedly kind of sweeping the matter under the carpet, since that encoding (and any kind of semantic encoding in general) faces the same 3 problems outlined above: the semantic markup will overlap and interfere with the philological markup. Possible solutions for that include fragmenting the semantic markup and linking its parts with xml:id; or using milestone-like empty elements for it; or using standoff markup. All are complicated, and all but the last will make our files much less human readable. I strongly think that this is not something we should try to do at the moment. Adding all sorts of semantic details to the files promises great research for the future, but we cannot do everything in one go. Devising a sustainable way to add semantic annotation to DHARMA-flavour EpiDoc files could be a separate minor project for two or three people over two years or so.

So our options at the moment are:

A. leave things as they are now, accept that there will be fuzzy cases, and close this thread; or B. give up encoding <num> on words and revert to what we had in the EGD before May 2021 (perhaps suggesting that measure could be used for quantities and commodities); or C. reopen the discussion and spend dozens to hundreds of hours working out something that does better justice to the contents, but will come at the cost of immensely complicating our markup.

The above order of options is my order of preference.

arlogriffiths commented 6 months ago

Thanks Dan. I am sorry that some such discussions go through such a weird process before being concluded. The fact that we all have a lot of work on our plate has something to do with it. I don't remember what I may have commented on EGD 7.1.4 at an earlier stage and there's certainly no need to dive into the version history of the gdoc to find out.

Having re-read the discussion above, as well as the intro to EGD 7.1.4, I am struck by the absence of a clear definition of the purpose of our use of <num>. I haven't checked what TEI and EpiDoc say about it. I would, before being reconfronted with this discussion, instinctively have responded that we use <num> in connection with the history of scripts, i.e. making it possible to assemble data for the study of the number systems that were in use (decimal place value or not) and on the graphic shapes used to express positions in the respective systems. If that answer is at least partly correct, then our EGD rule "when a glyph that would normally be a numeral sign is used in a function other than to represent a number (such as the glyph normally meaning 1, occasionally used as an auspicious opening mark), then the <num> tag must not be added to it (§4.2.7)" might not make perfect sense. Again, if the above answer is at least partly correct, then my preference expressed a few times in earlier iterations of this discussion (though never as a hard imperative) to allow people putting <num> around words might not have been well considered.

I'd like to know how you view the rationale for our use of <num. Depending on your answer, I might prefer A or B among the options you give above.

My request for a list from @michaelnmmeyer was intended to allow us to determine how many instances of this use of <num> we actually have in the inscriptions encoded so far. Surely, if we have just a few handfuls, we will more easily opt for B than in case we have thousands.

michaelnmmeyer commented 6 months ago

We have:

   8979 I
    497 II
    231 III
     93 IIII
     31 IIIII
     21 
     14 sa
     13 IIIIII
     10 tluṁ
      9 X
      7 vyara
      7 mvāya
      7 dvaya
      5 IIIIIII
      4 ruA
      4 ½
      3 rla
      3 pataṁ
      3 panneraḍu
      3 pañca
      3 mūṟu
      3 daśa
      2 XII
      2 vyar·
      2 ṣoḍaś
      2 sārddha
      2 ruAṁ puluḥ
      2 praṁpiya
      2 pataṁ puluḥ
      2 mvāy·
      2 mūvattu
      2 mūru
      2 IIIIIIIII
      2 gra
      2 eṁṭu
      2 dvayā
      2 daśamĭ
      2 catuḥ-sahasra
      1 XI
      1 vvalu puluḥ
      1 vuAluṁ puluḥ
      1 tri
      1 trayo
      1 tai rat· III
      1 ṣodaśottaraṣaṭśata
      1 sī rat· III
      1 sāyiradanūraṁ
      1 sāyira
      1 ṣaṣṭi
      1 sā rutuḥ limā pluḥ sā
      1 sārddha-nava
      1 sapta-pañcāśad-anvita-catuś-śata
      1 rvaṁ
      1 raḍu
      1 radiḻnūṟu
      1 pvāna
      1 pvān
      1 pratipāda
      1 praṁvyal·
      1 prāṁm·
      1 ppannircchāsiram
      1 piy·
      1 panneraḍumann
      1 panne
      1 pañcadaśi
      1 pādona-ṣaṭṣśata
      1 pādā
      1 ondu
      1 nūrayvatt’
      1 nava
      1 mvaya
      1 mūvatteraḍum
      1 mūvattaṁ
      1 mūnūṟu
      1 mūnūṟayvattu
      1 kulya
      1 katlu
      1 katiga
      1 Isī
      1 irppatan
      1 irpattu nālku
      1 IIIIIIII
      1 I ½
      1 eṇchāsiraṁ
      1 eṇchāsiram
      1 eṁṭunūṟu
      1 eṁṭunūru
      1 Eḻunūṟayvattu
      1 eḻpattumaṁ
      1 Ekădaśi
      1 dvāviṁśa
      1 droṇa
      1 daśami
      1 caturtha
      1 bpataṁ
      1 bhai mvāya
      1 bhai mvāy·
      1 bhaiḥ
      1 ayvadiṁbaruṁ
      1 aynūṟuvaṁ
      1 aynūṟu
      1 aynūru
      1 Aṣṭami
      1 asṭami
      1 aṟuvattu
      1 āṟu
      1 a hundred
      1 āḍhavāpa

danbalogh commented 6 months ago

In my view, our encoding of <num> is primarily semantic, not palaeographic. This is also why it is in section 7 "Additional information" in the EGD, and not in section 4 "The originally inscribed text". The TEI definition is: "contains a number, written in any form". In that respect, permitting it around words does make sense, and is TEI-approved. For palaeographic studies, <g type="numeral"> is more appropriate, but we don't use that around digits 1-10. However, I think doing a simple search for a particular number (or a wildcard search for any number) should be feasible for palaeographic studies of numeral signs, since I assume that the search will (or can be made to) ignore editorial numerals such as line and stanza numbers. All in all, to my mind, the principal usefulness of <num> is in additive numbers where e.g. 100 + 20 can be tagged with the number 120. But to be honest, I haven't pondered the purpose, and I've always just accepted this encoding as a given, since it's part of the EpiDoc guidelines (which I've just checked, and which also say nothing about why it's good to do this). The potential to use it for studying whether numbers are written additively or in decimal place value is I think present whether we allow the tag around text or not; it would have to be used in combination (e.g. "do we have numeral characters [0-9] in a <num> tag without also being in a <g> tag?"; "do we have a <g> tag inside a <num> tag?).

Given this and the large number of cases in Michaël's list, I continue to think we should keep things as they are now.

arlogriffiths commented 6 months ago

Thanks Dan. Reading your response, I think there might be potential use in considering doing without <num> altogether. This would be a signficant lightening of the burden of encoders. Before taking any such radical step, it would be necessary to consult MARKUP on what benefit the community sees in this element's use.

Comments on the above list:

several entries, esp. those of highest frequency, are cases of use of 'roman numerals' in representing Khmer numerals, as prescribed in TG: they are not numbers words as understood in this discussion
there seem to be some Sanskrit terms (āḍhavāpa, droṇa, kulya, pratipāda, pādā) that I would under no circumstance classify as number terms, and I think the cases must be sought out and probably corrected
there are also several ordinal numbers in some languages that I known (katiga, caturtha, etc.) which raise the question whether <num> is supposed to be applicable to ordinals — does our guide say anything about it?
there is an English term (hundred) which I don't suppose requires any <num> wherever it may occur
and there are lots of Tamil (or other Dravidian) terms which only @manufrancis and @AnneSchmiedchen can comment on
it would be useful if @michaelnmmeyer could specify file of occurrence at least for the cases that occur only once

danbalogh commented 6 months ago

The TEI guidelines explicitly say that ordinals can be encoded with <num>. Our guide says nothing about this. Thinking about it, they certainly should be included: "in the year fifty-eight" and "in the fifty-eighth year" mean the same after all, so it would be really bizarre to tag only the former. I agree with the rest of your observations.

manufrancis commented 6 months ago

As for my practice with <num> I use it in the following cases:

<num value="10"><g type="numeral">10</g></num> = "10"
<num value="11"><g type="numeral">10</g> 1</num> = "11"
<num value="10"><g type="numeral">10</g></num>-Āvatu = "10th"
<num value="11"><g type="numeral">10</g> 1</num>-Āvatu = "11th"

There is never text/word inside my <num>

As for the options suggested by Daniel above, excluding C, as I fully agree that we do not need to

spend dozens to hundreds of hours working out something that does better justice to the contents, but will come at the cost of immensely complicating our markup

I would opt for A. leave things as they are now, accept that there will be fuzzy cases, and close this thread

I have however no objection to B (give up encoding <num> on words and revert to what we had in the EGD before May 2021, perhaps suggesting that measure could be used for quantities and commodities), as I do not foresee that I will encode words with <num>, provided it is not too much extra work.

erc-dharma / tfc-khmer-epigraphy

<num> with numbers written in letters #6