TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
271 stars 88 forks source link

Misuse of @xml:lang in an example for msPart #2252

Closed MarjorieBurghart closed 1 year ago

MarjorieBurghart commented 2 years ago

Hi! In an example featured on the msPart element description:

https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-msPart.html

the @xml:lang attribute is misused on this element:

 <msContents>
  <summary xml:lang="lat">Miscellany of various texts; Prudentius, Psychomachia; Physiologus de natura animantium</summary>
  <textLang mainLang="lat">Latin</textLang>
 </msContents>

With this encoding, we would expect the summary to be in Latin. It's probably a little confusion with textLang/@mainLang.

sydb commented 2 years ago

Worse, we would expect the summary to be in whatever "lat" means. Latin is "la".

MarjorieBurghart commented 2 years ago

Worse, we would expect the summary to be in whatever "lat" means. Latin is "la".

Actually... Why isn't @xml:lang limited to the values of the adequate list of languages? I mean, using 3-letter language codes is a very common mistake, and I am sure I have also been guilty of it.

martindholmes commented 2 years ago

@MarjorieBurghart We can't really constrain @xml:lang because private use extensions (-x-*) are acceptable, and because the IANA language subtag registry does change steadily over time.

MarjorieBurghart commented 2 years ago

Fair enough.

sydb commented 2 years ago

Good question. The list cannot be enumerated, of course, because there are way too many combinations of language-script-region-variant-extension. But several of us have written regular expressions that can certainly help. But in order to catch this sort of error, the regular expression has to include the actual list of registered language tags, which means updating the regular expression every time that list gets updated. (Last updated exactly 1 month ago today.) Now that I think about it, though, it should be possible to write a routine that downloads that list (and perhaps the lists of scripts and regions too) and generates the regular expression at TEI build time. Hmmm … P.S. I have already done part of this work for the WWP. These days I would probably do this using invisible XML instead, particularly now that the NineML tools are available. Hopefully Aparacium will be available, soon, too.

lb42 commented 2 years ago

I was just looking at invisible XML myself. I wonder how it would cope with bare text or PDF... Seems like all the examples are about other formal languages with nice simple grammars

martindholmes commented 2 years ago

There is a tool for turning the subtag registry into a regex for checking things:

https://github.com/projectEndings/diagnostics/blob/dev/utilities/subtag_reg_to_xml.xsl

sydb commented 2 years ago

@lb42 (not that this is the place to discuss iXML, but since you bring it up …) Yes, it is about transforming text that can be expressed with a nice grammar. Whether it is simple or not is up to you. (The examples are all simple for pedagogical reasons, I suppose.)

@martindholmes — Well, that sorta takes the fun out of it, doesn’t it? Any reason not to make a regex like this one the definition of teidata.language? (I say “like” because at the very least the leading carat has to be removed.) Probably would have to special-case the output of the constraint in the tagdoc, but still might help users quite a bit.

@MarjorieBurghart — Note that the regex Martin’s code generates is 53,651 characters long.

sydb commented 2 years ago

Oh my, @MarjorieBurghart, you have stumbled into a bit of a hornet’s nest. Besides the language problem you point out (which I have to admit, I do not know how to encode properly), there is also a <remarks> element (which shows up on that page as a “Note”) that refers to “that last example”, and says the weirdest thing about it: that it demonstrates you can use “<altIdentifier> rather than <msIdentifier>”, although this usage is “deprecated”. Besides the fact that we now use the word “deprecated” for a much more formal purpose, each of the 3 <altIdentifier>s on the page is a child of an <msIdentifier> — so none of them could be replaced by an <msIdentifier>. Furthermore, a <remarks> element occurs after the examples in the XML, but shows up before the examples in the HTML output. So I am really confused as to which example that annotation is referring to.

So first, how should the summary “Miscellany of various texts; Prudentius, Psychomachia; Physiologus de natura animantium” be encoded? The initial part is in English, the final part in Latin, yes?

Second, to which example does that Note belong? (If any; did there used to be another example?)

MarjorieBurghart commented 2 years ago

@sydb gasp

sydb commented 2 years ago

Assigned to @bleekere to fix the actual problem @MarjorieBurghart identified. Since I do not do manuscript description and do not speak Latin and do not know the mss being described, I am not sure, but two real possibilities are

    <list>
      <head xml:lang="en">Miscellany of various texts;</head>
      <item xml:lang="la">Prudentius, Psychomachia;</item>
      <item xml:lang="la">Physiologus de natura animantium</item>
    </list>

or

      <seg xml:lang="en">Miscellany of various texts;</seg>
      <seg xml:lang="la">Prudentius, Psychomachia; Physiologus de natura animantium</seg>

Assigned to me to create separate issues for an @xml:lang checking process at build time and the <altIdentifier> issues I noticed.

bleekere commented 1 year ago

@MarjorieBurghart following up on @sydb 's comment on June 9: do you have a strong preference for either encoding? If not, I propose we opt for:

<msContents>
  <summary>
  <list>
  <item xml:lang="en">Miscellany of various texts;</item>
  <item xml:lang="la">Prudentius, Psychomachia;</item>
  <item xml:lang="la">Physiologus de natura animantium</item>
  </list>
  </summary>
  <textLang mainLang="la">Latin</textLang>
 </msContents>

If I'm not mistaken, the example is taken from this manuscript which seems to contain a number of texts, all in latin. But this issue really only concerns a correction of the fact that in the current encoding, the @xml:lang="la" on the <summary> element suggests that the <summary> element only contains latin text.

martinascholger commented 1 year ago

Council F2F suggests to implement the example proposed by @bleekere unless @MarjorieBurghart strongly objects.