erc-dharma / project-documentation

DHARMA Project Documentation

Creative Commons Attribution 4.0 International

3 stars 3 forks source link

Encoding of languages and scripts #226

Closed michaelnmmeyer closed 5 months ago

michaelnmmeyer commented 1 year ago

I have a few observations to make on the encoding of languages and scripts. (See the bottom of this post for references).

Language codes

The now-obsolete x-something codes can be removed in favor of the new ones in the latest ISO 639-3. The mapping is:

x-oldcham   ocm
x-oldkhmer  okz
x-oldmalay  omy
x-oldsun    osn

We have a custom language code unknown, but there is already und (for "undetermined") in the standard. We could advise to use this instead.

We use the codes btk and pra. They do not belong to ISO 639-3 but to ISO 639-5. This could be made clear in the table. And it could also be made clear that people should use these two codes instead of the various Batak and Prakrit language codes in ISO 639-3.

There are currently in our texts many invalid values for @xml:lang. I will thus close the list of languages and expand it as needed.

I propose to update the current list of language codes with:

ara
ban
btk ISO 639-5
cja
cjm
deu
eng
fra
ind
jav
jpn
kan
kaw
khm
mya
ndl
obr
ocm
okz
omx
omy
ori
osn
pli
pra ISO 639-5
pyx
san
sas
tam
tel
tgl
und
vie
xhm
zlm

And I propose to add a table of script codes. The relevant ones for now are:

Gran
Latn
Taml
Thai
Zyyy

Tagging languages and scripts

I have a few issues with the current encoding:

It is not possible to tell the script a passage is written in
It is not possible to tell the language (and the script) of passages within <foreign>
We have many ways to indicate languages and scripts, the purpose of which is unclear to me:
1. <someTag xml:lang="xxx-Xxxx">
2. <foreign> (implicit language and script)
3. <foreign xml:lang="xxx-Xxxx">
4. <someTag rend="grantha"> (why not use Gran from ISO 15924?)
5. <hi rend="grantha">
6. <langUsage> (in the EGC, but not in the EGD)

So far I can only formulate two rules:

The @xml:lang of the root element, if not specified, is assumed to be en-Latn
Elements that do not have a @xml:lang inherit the @xml:lang of their parent.

There are supposed to be exceptions to these rules (related to <foreign>, <note> and <langUsage>), but it is unclear to me how they should interact. Furthermore, I cannot tell the distinction between:

The script used in the original inscription, manuscript, etc.
The script the passage is encoded in within the TEI edition
The script that should be used for display on the Website

One further remark about scripts. The EGD state: "language tags without a script code will by default be assumed to be in a native script associated with the language in a given region and time period." Instead of this, I propose to state that the script is assumed to be Latn unless explicitly specified, because:

In practice, people forget to add the -Latn script code. I found only half a dozen cases where the script is not mentioned and is actually non-Latin.
It is impossible to reliably tell which script is used if it is not explicitly indicated in @xml:lang.
It is not possible to indicate a script that does not exist in Unicode.
It is more economical not to add -Latn all the time, since Latin is used almost everywhere.

References

Guides at EGD §10.3 and EGC §8.10. And tables in EGD p. 146 and EGC p. 160.

ISO 639-3 (languages): https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab (also see https://en.wikipedia.org/wiki/ISO_639-2#Special_situations)
ISO 639-5 (language families): https://www.loc.gov/standards/iso639-5/id.php
ISO 15924 (scripts): https://www.unicode.org/iso15924/iso15924.txt

arlogriffiths commented 1 year ago

I will be happy for @danbalogh to respond on most points.

Just one point from me: as far as I recall, we have accepted early on that our database will not allow determining what is the current script in any given part of a file. Our decision was probably guided by the following considerations:

script classification is a major (and imperfectly resolved) challenge
by far the mjaority of inscriptions are in a single script
or if not they have a predicatble matching of language to script

@michaelnmmeyer : have you taken note of how we are proposing to distinguish Script class and maturity and of the vocabularies for these developed on OpenTheso? Have you seen the Metadata Memo and Guide?

danbalogh commented 1 year ago

I'll take this on piecemeal. Starting with the tags themselves. First, please tell me what table this is all about. Is there or will there be a publicly available list of the language tags that we use in the project? If yes, I'd like to have the uri because this makes Appendix D of the EGD obsolete - we can just refer to that central list instead of maintaining an appendix.

und for undetermined language sounds good to me; in fact, I was not aware that we had a custom language tag unknown I have no idea if our codes btk and pra belong to ISO 639 2 or ISO 639 5; I don't really understand the difference between these two. (Is it that 639 5 is hierarchical while 639 2 is not?) At any rate, we do indeed need to point out that these tags are not from ISO 639 3.

ISO script tags are deliberately avoided. See EGD 10.3.1. I do not think we need any of the script tags you list but -Latn. I'm not even sure about -Latn, but we have agreed to attach it to non-European languages to make it explicit that they are in Romanised transliteration.

danbalogh commented 1 year ago

On tagging languages and scripts. In general, I think it would have been a good idea to carefully read the bits of the EGD and EGC where the relevant tags and codes are mentioned before asking these questions.

"It is not possible to tell the script a passage is written in" - what do you mean by a passage here, and what do you mean by written in? Almost everything in our XML files is written in the Latin alphabet, but the edition divisions concern text originally inscribed or written in a historic alphabet. We have in fact three potentially independent items that need to be shown.

Language of the contents of an XML element.
Script of the contents of an XML element.
In the edition division (and in text cited elsewhere from the edition division): script of the original inscription or manuscript.

The script that should be used for display on the website is the script of the element contents (unless we get to the stage where we can display editions in an Indic script, which I think we are not planning to do very soon). I'll try to tackle 1 and 2 first.

On the rules you formulate:

"The @xml:lang of the root element, if not specified, is assumed to be en-Latn"
- in our template, the <TEI> element explicitly specifies xml:lang="eng". I'm not aware that we have any content outside TEI, so we are making no assumptions here. Are you saying that this is insufficient and language should be specified on the root element?
Elements that do not have a @xml:lang inherit the @xml:lang of their parent.
- Of course. Cf. TEI Guidelines: "The xml:lang value will be inherited from the immediately enclosing element, or from its parent, and so on up the document hierarchy." I don't think we need to explicitly restate basic TEI rules.

We might, however, want to add a further default rule: whenever the language tag comes without an attached script tag, the script most typically used for that language is to be understood (EGD 10.3.1). I cannot point to ISO or TEI documentation in this connection, but I have the feeling that this is generally presumed elsewhere too.

So, how do we determine the language of a certain part of the XML document (this is probably what you meant by passage)? By default, it is inherited from the TEI element, but it may be explicitly stated with @xml:lang on any element. Elements that always have an @xml:lang of their own are the edition division and the translation division. Next, we come to situations where there is more than one language inside a div element. This typically occurs in one of the following situations:

within the edition div of multilingual inscriptions
- here, the edition has a default language encoded on the edition div, and passages in a different language get @xml:lang on the appropriate container(s), which may be textpart divs, <p>, <ab> or <lg> elements, or, when there is no corresponding structural container, a <foreign> element
- all of these are encoded with @xml:lang with both the language tag and the script subtag (the latter is always -Latn, since the edition is in Romanised transliteration)
foreign-language words, phrases or passages occurring in a modern international language context, typically in the commentary and the translation div
- the tag <foreign> without @xml:lang is used only in such a context (i.e. it can only occur outside the edition div; it is an error if it occurs elsewhere), and is then understood to be in the language (and script, i.e. -Latn) encoded on the edition division.
- I admit this is not very nice, but when you are writing a commentary on a text citing dozens or hundreds of bits from that text, it is extremely tedious to have to encode language on each one of those. If you say it should absolutely not be done, then we could consider an automated transformation that would pick up any foreigns outside the edition div and add the edition div's language attribute to them. But to be honest, we don't really need that language attribute anyway; I don't think we'll ever want to do language-filtered searches on phrases in the commentary/translation, or Indic script display for Sanskrit (etc.) words cited in such a context. Basically, all we need these foreigns for is italic display.
- words, phrases or larger chunks in a modern language other than the context and the edition's language typically occur when cited from publications, e.g. there may be a bit of French, German, Hindi or Thai amidst an English discussion.
- In this case, language is encoded (with @xml:lang on a pre-existing container such as <p> or <q> if available, and on <foreign> when there is no corresponding structural container)
- script is typically not encoded in these cases, because it is implicitly assumed to be a or the typical script for that language, so e.g. if somebody is citing a Hindi publication, it is assumed to be Devanagari; if citing Italian, it is assumed to be Latin.
- script may be encoded in these cases when citing a non-international language in transliteration, e.g. someone might use Sanskrit words or phrases while discussing a Khmer inscription; in this case, the script -Latn should be added.

I think this covers all language situations outside the apparatus div. Within the apparatus, lemma and reading elements are understood by default to be in the language (and the -Latn transliteration) of the edition div, while apparatus note content inherits English from the TEI element (but <foreign> without @xml:lang is permitted in these notes too, and is then again in the language and script of the edition div).

In your list of "many ways whose purpose is unclear", this covers
1. <someTag xml:lang="xxx-Xxxx"> = pre-existing container in a language other than the language of the enclosing div
2. <foreign> (implicit language and script) = bits cited from the edition in divs other than the edition
3. <foreign xml:lang="xxx-Xxxx"> = bits in a language other than the language of the enclosing div, when there is no pre-existing structural container for that chunk

Does the use of any of the above remain unclear? Do you find any of these redundant?

danbalogh commented 1 year ago

Going on to 3 above. The script(s) used in the original inscriptions/manuscripts are not encoded with ISO subtags for the following reasons:

the ISO subtags are generally designed for modern scripts plus a small number of very specific historic scripts (e.g. Siddhamātṛkā), whereas we want a somewhat more granular distinction between varieties of what would mostly be just "Brāhmī" in an ISO notation system
we disagree with the implication that script is subordinate to language. Across our corpus, a single language (e.g. Sanskrit) can be written in dozens of distinguishable scripts, and a single inscription in a single script can contain two (or more) languages such as Sanskrit and Prakrit or Sanskrit and Telugu. Sometimes, bilingual inscriptions also use a different script for the different language, but this is far from being always the case.

The historic alphabets used in the originals are indicated by the script class and maturity in the @rendition element of the edition division or, where applicable, on a smaller unit within that division, such as a pre-existing container (textpart div, p, ab or lg) or, when there is no pre-existing container, on a <seg> element. In addition to this, for Tamil inscriptions there is the extra option of tagging Grantha. I'm not a Tamilist (talk to Manu if you need clarification), but as far as I understand, Grantha characters can occur individually (within a word the rest of which is not in Grantha but Tamil or Vatteluttu), and it is traditional in editions of Tamil texts to highlight them (e.g. with bold face), so we wanted a simple way of tagging them. This is the attribute @rend="grantha", encoded on a pre-existing container when one is available, and on <hi> when one is not available. This covers your iv. and v. in the list of many ways and I hope makes all clear. The reason we don't use Gran from ISO 15924 is, as above, the fact that we reject the idea that script is subordinate to language (this being a textbook example of when it is not, as Grantha characters can occur within a single Tamil or Sanskrit word the rest of which is written in another script).

danbalogh commented 1 year ago

Now to the rest. I don't know about <langUsage>.

What you cite from the EGD, "assumed to be in a native script associated with the language in a given region and time period" to "assumed to be in the script most typically used for the language in a modern context " - has been obsolete for over two years, like much of the official release 1 of the EGD. The latest (insider) working version of the EGD is what is reflected in the google doc version. There is also the post-v1 work version on Github which I update from time to time on the basis of the latest changes in the google doc. See EGD §10.3.1 in one of these up-to-date versions.

You propose to state that the script is assumed to be Latn unless explicitly specified. I'm not strictly against that, but I'm far from sure we would benefit.

You say, "In practice, people forget to add the -Latn script code. I found only half a dozen cases where the script is not mentioned and is actually non-Latin."
- And how many cases where -Latn was not present but should have been? People should not forget the script code, but if they really do it often, then that may be a good reason for us to make that default. The half dozen cases of non-Latin without an explicit script subtag should be occurrences of a modern Asian language in an English or other international context, I guess, like my example with a Hindi citation above.
You say, "It is impossible to reliably tell which script is used if it is not explicitly indicated in @xml:lang"
- I assume you are talking about those half dozen cases here, where a citation is e.g. encoded as being in Thai. Why is it then impossible to deduce that the script is Thai?
You say, "It is not possible to indicate a script that does not exist in Unicode."
- This is irrelevant. We don't want to indicate any scripts that do not exist in Unicode. We are talking only about the script of the element contents here, not about the script of the original.
You say, "It is more economical not to add -Latn all the time, since Latin is used almost everywhere"
- I agree. The reason we now do what we do is that many of the languages we cite in Romanised transliteration are typically written in a non-Latin script. If something were just encoded as e.g. "san", then people and machines would assume it's (modern) Devanagari, since that is the script typically associated with Sanskrit.

danbalogh commented 1 year ago

There are probably a couple of details I have not addressed, so do point them out where needed. I know there is the option of using <term> and <gloss> in an edition (EGD §7.2.2), where the term is to be understood by default to be in an unspecified foreign language. This could be improved, but I have no idea if anybody has ever used this encoding, so it may be wasted work.

And there is also the question of display, but I think the only uncertainty remaining there is when to italicise something and when not. Otherwise, language should not affect display. Anything in <foreign> should be italicised, but ideally, text in a modern Asian language (and script) should not be - we could just ignore that and italicise anyway, or work out more complex rules as needed. (If we adopt your suggestion of understanding -Latn by default, then the rule could be simpler: italicise <foreign> unless there is an explicit script subtag other than -Latn.)

michaelnmmeyer commented 1 year ago

@danbalogh

Thank you very much for your detailed answers. I'll take this on piecemeal as well.

List of languages

Currently, we have two different language tables, in EGD p. 146 and in EGC p. 160. I propose to update the one in the EGD and to refer to it within the EGC (instead of using another table). The list of language codes I gave above merges the contents of both tables and uses standard codes. It is a superset of the list of languages currently used in the project.

As far as I can tell, the main difference between the ISO language lists is that 639-5 enumerates language families, while 639-3 enumerates languages. For instance, we have a single code pra "Prakrit languages" in 639-5, but three codes in 639-3: pka "Ardhamāgadhī Prākrit", pmh "Māhārāṣṭri Prākrit", psu "Sauraseni Prākrit". Similarly, we have a single code btk "Batak languages" in 639-5, but eight codes in ISO 639-3 (among which bya "Batak", this is probably the one we want).

List of scripts

Concerning scripts, our TEI files are currently written in (at least) Latin, Tamil and Thai. I need to know the script a given portion of the file is written in for three functionalities:

Display the text in Latin transliteration or in an Indic script (this is a request from @manufrancis).
Search portions of the text that are encoded in an Indic script as if they were transliterated.
Filter files according to the Unicode scripts they use (this is arguably less useful).

danbalogh commented 1 year ago

@michaelnmmeyer Thanks for the clarification on the language table. I think it would actually be best to create and maintain a separate language code list in our project documentation, to which both guides could refer. This could be just a plain txt file or an xml file.

About ISO systems, my question was about the difference between ISO 639 2 and ISO 639 5. I am aware of the difference between 639 3 and 639 5. But 639 2 also seems to have pra and btk, just like 639 5. I have not checked any further than this.

About the practice of not explicitly encoding script when it is the one normally used for the given language, see e.g. https://www.w3.org/International/questions/qa-choosing-language-tags.en : "Script subtags should only be used as part of a language tag when the script adds some useful distinguishing information to the tag. Usually this is because a language is written in more than one script or because the content has been transcribed into a script that is unusual to the language (so one might tag Russian transcribed into the Latin script with a tag such as ru-Latn)."

About your general concerns. As far as I know, portions of our XML files encoded in an Indic script should be few, short, and limited to terms, phrases or short citations in a modern Asian language, always occurring in the context of an international language (usually English), primarily in translations and commentaries, but perhaps also in other places (e.g. apparatus note, bibliography division). If this is not the case, then I don't know what the exceptions are and why they exist. If it is the case, then displaying such bits in transliteration and searching them as if they were transliterated is in my opinion very low priority. I could totally live without it, and I don't think anybody else would consider it high priority or essential. Manu's request, I'm pretty sure, was that our editions (encoded in transliteration) should be eventually displayable in Tamil script (and possibly other Indic scripts). I don't think he needs bits of Tamil (or other non-Latin-writing language) scattered in commentaries to be displayable in transliteration.

Be that as it may, if auto-transliterating bits of content written in an Indic script is important, then it should be possible to determine the script of those bits because their language should be encoded and the absence of the -Latn subtag on the language code should indicate that it is the default script for that language. All we need is an authority list matching the default script subtag to each relevant language code, if no such thing already exists (e.g. http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry looks like it could serve this purpose).

arlogriffiths commented 1 year ago

@danbalogh and @michaelnmmeyer— FYI, the one file with Thai script that we have (DHARMA_INSCIKthaiTest.xml in tfc-khmer-epigraphy) is a test case. Some colleagues in Thai/Lao studies may want to take up encoding inscriptions in the relevant languages using our Guide though without trying to transliterate the edited text into Roman. (There is no commonly agreed upon scientifiic Romanization of Thai/Lao scripts.) I was trying to show them it is possible.

michaelnmmeyer commented 1 year ago

I was indeed working with an older version of the EGD, sorry for that.

Determining languages

About the first rule I formulated:

The @xml:lang of the root element, if not specified, is assumed to be en-Latn.

This is to imply that people do not need to explicitly specify en-Latn and that they can even use a different @xml:lang for the root element if it is more convenient. When processing files, I will modify the XML in such a way that it adheres to the guidelines.

The difficulties I have concern the use of <foreign> and <note> without an explicit @xml:lang. There are corner cases (a <foreign> within a <foreign>, a <note> within a <foreign>, etc.), and the elements from which to pull the appropriate @xml:lang will vary between texts (diplomatic editions and critical editions do not have a <div type="edition" xml:lang="xxx">, for instance). I will check further what people are doing and try to deduce a common rule.

Determining scripts

I found about thirty elements where @xml:lang mentions a non-Latin language and does not specify a script. Six of them actually contain text in a non-Latin script (the others should have xxx-Latn according to the guidelines). Besides these, twenty-five elements explicitly specifiy a non-Latin script, but, looking further, most of them actually contain Latin-only text...

For simplicity's sake, I propose to forget about the tagging of scripts within @xml:lang. I will try to infer the actual script with Unicode tables and process the text accordingly. This is sometimes impossible, but it should work well enough in our case.

danbalogh commented 1 year ago

One thing that I should perhaps have made clearer earlier: I am not really familiar with the guide for critical editions. What I say is based on the guide for diplomatic editions.

The @xml:lang of the root element, if not specified, is assumed to be en-Latn.

This is to imply that people do not need to explicitly specify en-Latn and that they can even use a different @xml:lang for the root element if it is more convenient. When processing files, I will modify the XML in such a way that it adheres to the guidelines.

Please clarify. As I said above, @xml:lang="en-Latn" is already specified on the <TEI> element in our files. Or, looking at the EGC, I find that critical editions only have @xml:lang="eng". When you say you'll modify the XML so that it adheres to the guidelines, do you mean that you'll use "en-Latn" in the critical editions? If yes, fine. If you mean something else, then what guidelines and what modification do you have in mind? (Are there files without @xml:lang on <TEI>? Or are there files with a root element other than <TEI>?)

The difficulties I have concern the use of <foreign> and <note> without an explicit @xml:lang. There are corner cases (a <foreign> within a <foreign>, a <note> within a <foreign>, etc.), and the elements from which to pull the appropriate @xml:lang will vary between texts (diplomatic editions and critical editions do not have a <div type="edition" xml:lang="xxx">, for instance). I will check further what people are doing and try to deduce a common rule.

I was not aware that critical editions did not have <div type="edition" xml:lang="xxx">. But looking at the EGC now, I find that they do have an <text xml:space="preserve" xml:lang="san-Latn"> which appears to be exactly equivalent. Foreign within foreign should, in my opinion, not occur without @xml:lang on at least one of them, and the cases where they do occur should be rechecked by their encoders. I also think that if someone adds a note within a foreign, then that note should have @xml:lang - again, to be rechecked and rectified by the encoders. I too may be guilty of such things, but these are in my opinion not use cases for which you need to cater, but rather errors of encoding that you have spotted and we as encoders should correct.

For simplicity's sake, I propose to forget about the tagging of scripts within @xml:lang. I will try to infer the actual script with Unicode tables and process the text accordingly. This is sometimes impossible, but it should work well enough in our case.

As far as I am concerned, this suggestion is welcome, except that I would prefer to retain the explicit -Latn on transliterated text because that seems to be the recommended good practice. I'd be very happy if we did not have to make explicit provisions for various non-Latin scripts. I could also accept dispensing with script subtags altogether, but would like to hear some sound reasons why we should go against the recommendation. And of course I think @arlogriffiths should have the final word on this.

michaelnmmeyer commented 1 year ago

When you say you'll modify the XML so that it adheres to the guidelines, do you mean that you'll use "en-Latn" in the critical editions?

Yes. Generally speaking, I try to automatize the encoding as much as possible. There are many situations in which I can mechanically supply the needed information. Doing so makes the encoding more homogeneous and easier for me to process. This also simplifies the encoding task: people do not need to fill in <idno> and <license>, for instance.

Concerning the encoding of @xml:lang, I can add or remove the -Latn suffix as appropriate, depending on the language and the element's contents, so I am fine with both solutions.

michaelnmmeyer commented 6 months ago

I have made progress in the language assignment system. Currently, we have this:

A. We follow the basic inheritance rule: an element that does not bear an @xml:lang is assigned the language of its parent element. The language of the XML document, i.e. the parent of the root <TEI> element, is set to English.

B. We break the inheritance rule when the element is one of lem, rdg or foreign, whatever the context these elements occur in. When these elements do not bear an explicit @xml:lang, they are assigned one that depends on their location in the document. There are two cases:

B1. If they occur within the edition, the apparatus, the translation or the commentary, they are assigned the language they would inherit if they appeared in the edition, as children of the same textpart divisions.

B2. Otherwise, they are assigned the language of the edition division.

Example of rule B1: if you have an inscription with two textparts A and B in Sanskrit and in Tamil, respectively, foreign elements in the commentary of textpart A are assumed to be in Sanskrit, and those from textpart B are assumed to be in Tamil. If the commentary does not use textpart divisions, or uses textpart divisions that bear @ns that do not appear in the edition, the foreign elements it contains are assumed to be in the language assigned to the edition division itself.

Rule B implies that, for multilingual inscriptions that have a predominant language, assigning an @xml:lang to the edition division, in addition to its textpart divisions, is useful; this reduces the number of foreign elements that need to be assigned an explicit @xml:lang.

Besides lem, rdg and foreign, note might also warrant an peculiar treatment, because in practice it is always in a modern, occidental language. Currently, a note nested into a foreign element will inherit the foreign language, which does not seem desirable.

danbalogh commented 6 months ago

Thanks, again, for formulating this. What I'm not sure I understand is the following:

You have "B1. If they occur within the edition, the apparatus, the translation or the commentary" and then "B2. Otherwise". What are the cases of otherwise? The TEI header (e.g. hand note) and the Bibliography? Anywhere else?
How essential is it from a technical point of view always to be able to assign an @xml:lang to the content of <foreign> elements? When "foreign" words in a modern-language context (i.e. in notes including apparatus notes, translation, commentary, bibliography and possibly parts of the TEI header) are not explicitly tagged with @xml:lang, we have been deliberately vague, saying simply (EGD §10.3.3) that this means the text is "in a language of study". This is actually more vague than what I said above (where I stated that they are understood to be in the edition div's language). Anyway, these would always be Romanised transliterations, and all should be displayed in the same way (italics). But I suppose that e.g. the commentary of a Javanese edition may use a couple of Sanskrit or Balinese words tagged in this way, as well as Javanese. To me, this is not a problem, and I think it isn't to Arlo either, since back then we had decided together to permit foreign without @xml:lang in such cases. So if you now make it a rule that these words automatically get the language of the edition division, we'll have false information for the Sanskrit and Balinese words, which can only be eliminated if the encoder goes back to review all instances of foreign in their files and check that they are not in a language other than that of the edition. Can we not simply just live with the fact that the language of such words is not specified? To my mind, in these particular cases, where no language-based machine action is foreseen, and where the informed reader is expected to know what language those words are, lack of specificity is more acceptable than specific encoding (or formal inheriting) of potentially wrong information.

If we can live with this, then your B rules remain in force for <lem> and <rdg>, but all instances of <foreign> without @xml:lang just remain "no language specified".

As for <note> we could either simply forbid its use inside <foreign> (moving the note outside the foreign element, as I suggested above); I am sure there aren't many instances of note within foreign in the corpus, so you could create a list and the encoders could make the corrections. Alternatively, if we are willing to accept and accommodate sloppy encoding (or if there are sound reasons unknown to me for putting a note inside a foreign), then you could just formulate the rule that a <note> element that is the descendant of a <foreign> element and does not have explicit @xml:lang inherits the language of the parent of the <foreign> element.

michaelnmmeyer commented 6 months ago

For your first point: yes, exactly.

For the second: in practice, this only matters for inscriptions where @xml:lang needs to be reliably machine-actionable. Right now, this concerns the Tamil ones: Manu wants a transliteration button for switching between a Tamil display and a Latin one, and he has a few bilingual inscriptions, so accuracy is required in this case.

OK for notes, I will check what people are doing.

arlogriffiths commented 6 months ago

You mentioned this desire of Manu's in this thread last year but at that time I didn't react, sorry. What about the basic incompatibility of several of our subaksara-level encoding rules with offering such a transliteration button? Has that challenge been resolved? E.g., how is this going to be dealt with: kaṇv<choice><orig>i</orig><reg>ī</reg></choice>ḻccil?

I am not sure why you say that @xml:lang needs to be machine actionable only in Tamil inscriptions. When in Old Javanese inscriptions, for instance, we encode

<div type="edition" xml:lang="kaw-Latn">
<p>Irikā divaśanyājñā pāduka śrī tiktavilva nagareśvara, śrī rājasanagara nāma rājābhiṣe<lb 
n="1r4" break="no"/>ka, <foreign xml:lang="san-Latn">raṇa-prathita-mantri-nirjjita-nr̥pāntaropāyana-surāṅgaṇopamānāneka-vara-kāminī-sevyamāna</foreign>, garbhotpattināma, ...

this is displayed in italics. This is thanks to a machine action, right? More significantly, I imagine that one of the potentially desirable machine actions would be for the vocabulary within <foreign> to be indexed as Sanskrit.

manufrancis commented 6 months ago

@michaelnmmeyer It might be relevant to your current thoughts that the diplay of Tamil in Tamil fonts concerns only edition and apparatus.

michaelnmmeyer commented 6 months ago

@arlogriffiths I have not reflected on transliteration issues for now, and have not discussed the details with Manu. For cases like the one you mentioned, we will have to color the whole akṣara in the display. It might thus be necessary to keep tooltips in Latin, to indicate more accurately which part of the akṣara is unclear, damaged, etc.

As concerns foreign elements, there are no issues as long as they bear an explicit @xml:lang (we can index them, transliterate them, etc.) The question is: what should we do with foreign elements that do not bear one? We can either:

Assign them (by induction) a single, specific language. This is what I proposed above. Doing that is necessary for transliteration. The downside is that this requires people to annotate more often their foreign elements.
Assign them a group of languages (maybe just the ones in the edition division). We would be able to say, for instance: "this piece of text is in Sanskrit or in Tamil", but we would not know the actual language. This is Dániel's solution.

For indexing and searching, a degree of fuzziness is acceptable, so the second solution might work fine. This really depends on what kind of query you intend to formulate.

@manufrancis OK, noted!

arlogriffiths commented 6 months ago

since Manu has explained that display of Tamil in Tamil font is only required in Edition and Apparatus, while Michaël has explained that this issue essentially concerns auto-transliteration of Tamil and <foreign> elements that do not bear @xml:lang, my question becomes: are there any such <foreign> elements in Edition or Apparatus anywhere in our xml files, a fortiori cases where Tamil is relevant as a language? I presume not, and if I am right on this assumption, then has this issue become obsolete?

michaelnmmeyer commented 6 months ago

There are about 11000 occurrences of <foreign> without an @xml:lang in apparatus notes. I propose we assume that, in such cases, the language meant is the one of the corresponding textpart division in the edition. In other cases, viz. when <foreign> appears in the translation, the commentary, etc., I propose to assume that the language meant is one of those that appear in the edition division.

danbalogh commented 6 months ago

@arlogriffiths, to elaborate on Michaël's answer above, I would like to add that displaying the contents of <foreign> in italics is not language-dependent, since all <foreign> elements are displayed as italics. Indexing the Sanskrit parts of a Javanese inscription is of course language-dependent, but @xml:lang is mandatory in the edition division, so it is not a problem with our current setup.

I thus think everyone agrees that no language-dependent machine action is foreseen for any <foreign> elements without an explicit @xml:lang.

There is a little bit of ambiguity left here, since Manu has stated above that automated script toggling for Tamil text is only desirable in the edition and the apparatus. I believe that what Manu meant was "in the edition and in the lemmas and readings of the apparatus", i.e. that apparatus notes do not require this toggle to work. Which, to spell it out, means that all the contents of <lem> and <rdg> elements are by default in Tamil for a Tamil inscription (i.e. in the language of the [corresponding textpart of the] edition, as a general rule), whereas <foreign> elements in a lemma or reading mean that the language is not Tamil (or not the edition language), and must be explicitly encoded with @xml:lang. @manufrancis, please confirm this or, if I'm mistaken, please state expressly that you need this toggling to work in apparatus notes too. If my assumption is incorrect, and bits of Tamil in apparatus notes have to be optionally displayable in Tamil script, then I see no way to achieve that other than to explicitly encode @xml:lang either on all foreign items in languages other than Tamil, or on all foreign items in Tamil, or on both. I am pretty sure apparatus notes to a Tamil text may sometimes contain Sanskrit (or Malayalam, Telugu, Pali, Portuguese, etc.) words. It is therefore incorrect to assume that any <foreign> item in an apparatus note is in the language of the (corresponding textpart of the) edition, unless we make it a rule that any <foreign> item in an apparatus note which is not in that language has to be encoded with an explicit @xml:lang (which will be very difficult for encoders to follow, and very troublesome to recheck the 11 thousand instances in already existing files). If my assumption is correct and no Tamil-script display is needed in apparatus notes, then we have an easier job and we are back to the original question: is there any problem with simply stating that <foreign> without @xml:lang means "one of the languages of study"? Again, to spell it out, I note here that this is not necessarily equivalent to "one of the languages of the edition division", since I am pretty sure that Sanskrit (and perhaps some other languages) do appear in commentaries/translations of texts that are not themselves in Sanskrit.

manufrancis commented 6 months ago

I confirm Daniel's understanding of the matter: toggle is desired "in the edition and in the lemmas and readings of the apparatus", i.e. that apparatus notes do not require this toggle to work. Let us have an easier job.

danbalogh commented 6 months ago

Thanks, @manufrancis . This means we are back to my point 2 in this post above: is there any reason why we need to change our current practice of

permitting <foreign> without @xml:lang only in parts of the file which are in a modern international language, viz.
- the commentary div, the translation div, and the bibliography div, including notes that are descendants of these divs,
- notes in the apparatus,
- and possibly the free-text parts of the TEI header;
agreeing that the specific language of these items will not be defined beyond "one of the languages of study";
accepting that these words will then not be processed by machine in a language-dependent way, e.g. they will not become part of any language-specific index.

michaelnmmeyer commented 6 months ago

OK for this.

To be noted that the distinction between source languages and the others is made here: DHARMA_languages_readme.md. Languages that do not appear in our languages table are assumed to be source languages.

manufrancis commented 6 months ago

@michaelnmmeyer One further thought. If it makes things easier, this "script toggle" might in fact be implemented only for "Physical".

Note: In both current "Logical" and "Physical" displays, editorial hyphens (added by editors, e.g. to split compounds and make these more easily grasped by the reader and at the same making explicit how the editor understands the compound) are displayed. I guess we would like such editorial only in "Logical", where they might cause complexities for script toggle, e.g. when the hyphen split a syllable.

michaelnmmeyer commented 6 months ago

@manufrancis

I think handling hyphens will be OK. Hiding them in the "physical" display is in my todo list.

danbalogh commented 6 months ago

Apropos of hyphens, we've always had vague plans, never concretised, of converting the compound segmentation hyphens to markup. This should, in theory at least, apply to all hardtext hyphens present in any part of the files that is in one of the source languages: chiefly the edition, lemmas, readings. It might be extended to items tagged as foreign without xml:lang, but I'm not entirely sure that is desirable, nor is it essential. So if you've been thinking about hyphens, it may also be a good idea to instead come up with some sort of markup (e.g. <milestone type="cpd"/> or whatever) to replace those hyphens with, and display that markup as hyphens in some views and as nothing in others. This would need the opinion of @arlogriffiths and possibly others before it is done, but I think it is feasible. The only concern I have with it is that once the replacement is done in the XML files themselves, they will then become even less human-readable than now.

michaelnmmeyer commented 6 months ago

@danbalogh Thank you, noted.

arlogriffiths commented 6 months ago

I don't think we want to do anything with segmentation hyphens other than in "source languages" (sorry, I have not been able to keep up with all the discucssions on this concept), and I am not even sure we need to do anything with such hyphens outside <div type="edition">.

If there is no significant computational gain from replacing such hyphens by something like <milestone type="cpd"/> in our XMl files, then indeed it may be nice to avoid further reducing the human-readability of our files. But if there is significant computational gain, then let us try to come up with a tag that is as slender as possible, ideally slenderer than <milestone type="cpd"/>.

arlogriffiths commented 5 months ago

could you tell us how this had been completed?