New attribute `@defaultLang` for `<tagUsage>`

martindholmes commented 4 years ago

This FR arises out of these discussions on TEI-L:

https://listserv.brown.edu/cgi-bin/wa?A1=ind2001&L=TEI-L#1 https://listserv.brown.edu/cgi-bin/wa?A1=ind2001&L=TEI-L#11

The basic idea is that it's very common, particularly in the case of dictionaries, to have a scenario in which all instances of a particular element are in one language, and all instances of another in another language, in a particular file. Rather than specify @xml:lang repeatedly throughout the document, it would make more sense to declare it once in the header, using <tagUsage>. Detractors have pointed out that XML has a mechanism for default attribute values, but this is done using a schema, not at the document level, so a collection which includes documents in various languages which all need to use the same schema cannot use this approach. Also, the notion that document content somehow resides in the schema rather than in the document itself makes many of us uneasy; default attribute values seem to violate separation-of-concerns.

lb42 commented 4 years ago

You could also consider adding an attribute or some substructure to <langUsage> which seems like an equally appropriate place to look for this information. An xPath-valued attribute might be useful for the case where say <div type="foo"> has a different default language from <div type="bar">

duncdrum commented 4 years ago

I m not a fan of either solution and would prefer not to have this feature, but if a defaultLang element is coming it should go to langUsage imv.

hcayless commented 4 years ago

My feeling about this is that it's fine as documentation, but that if I were implementing a solution that did anything, I would want the @xml:lang specified on every element. In stuff I do its quite common to want to (e.g.) switch the font, or font size, based on what language you're dealing with, and I don't want to have to reckon with two ways the language I'm currently dealing with might have changed. For this reason I'd actually prefer setting the default in the schema, and serializing that default when I publish.

Again, though, as a bit of documentation, this is probably a good idea.

dariok commented 4 years ago

@duncdrum Could you elaborate on why you'd prefer not to have such a feature? That might point out cases @martindholmes and I have not thought about in the discussion on TEI-L.

@hcayless The problem with setting the default in the schema comes when you have projects with numerous languages and it does have a few problems from a tech point of view.

A quick summary of the points I elaborated on in the discussion:

project POV

If you use defaults in the schema, you either cannot use that mechanism or have to use multiple schema files that only differ in the default language: While, e.g. your comments will always be in English, both <text> and <orig> will share a language but that one may change from file to file, Latin for some texts, German for other, English for yet another set.

Tech POV

Additionally, RNG does not implement attribute defaults unless you explicitly activate DTD compatibility mode.
Furthermore, a non-schema-aware XSLT processor does not validate input documents (and, e.g., Saxon HE is not schema-aware).

These two things mean that the use of default values makes the content and meaning of the XML implementation-dependent: the same set of XML, Schema and XSLT yields different results depending on which tool is used.

As @martindholmes said initially, this violates the fundamental principle of separation of concerns which is why we feel rather strongly about this.

@lb42 has a good point that an XPath would be even more powerful.

I'm not completely sure whether, as he and @duncdrum suggest, <langUsage> would be a better place. On one hand you wouldn't have to duplicate the language info, but then again this element applies to the source text and not our additions.

martindholmes commented 4 years ago

@lb42 @duncdrum @hcayless As far as possible, I would like documentation to be programmatically actionable, and both the simple proposal (@defaultLang on <tagUsage>) and the more sophisticated on Lou proposed (@xpath or something similar on <langUsage>) are processable, although the second is harder for an encoder to write and slightly harder to process. I would myself use this in a pre-processing step to decorate relevant elements with @xml:lang before a rendering process (probably), and certainly if I were creating a version for interchange I would do that. But for convenience and consistency when encoding, I think one of these approaches would be really helpful.

duncdrum commented 4 years ago

@martindholmes @dariok idgi. You want a pre-processsed file to be valid TEI, with the explicit goal that an interchange version would also be proper xml on top.

This is backwards in my mind. A pre-processed file can be whatever, and you can add @xml:lang programmatically to every element x without the proposed element.

A valid TEI file should always be an interchangeable file, if we stick to the goals formulated in the guidelines. That means Lang attributes according to xml rules, and there a defaultLang doesn’t exist.

martindholmes commented 4 years ago

@duncdrum In virtually all of our projects, we have encoder-level versions of files, where we make full use of all the efficiencies and conveniences available to us (centralized -ographies, prefix definitions, etc.), and versions generated from these which are intended for interchange, in which all the weirder features that require custom processing like these are normalized -- external resources are imported, non-Julian dates are normalized, prefixDefs are resolved to straightforward links, and many other things are done which create a version of the file which any project or encoder could pick up and use without needing to learn the details of our project. This seems like common sense and good practice to me. Encoders need to work in environments with as few distractions as possible, where pointers are short and easy to understand, and the XML is as sparse as can be achieved through existing TEI mechanisms; but processors and interchange users need explicitly-realized, exhaustive versions of the files. So the build process generates them as a matter of course.

duncdrum commented 4 years ago

I have no doubt about your practices. But the proposed attribute will allow for bad practices to be valid TEI which I consider bad design. Why not lobby for xml:defaultLang as opposed to tei:defaultLang

If lang would be a TEI namespaced Attribute i would fully support your idea. But an interchange format is not the place to address the shortcomings of xml namespaced attributes.

However, the problem I do care about is unihan, I don’t care deeply about this however it goes, since I ll remain free to ignore the addition.

martindholmes commented 4 years ago

@duncdrum The (presumably not serious) suggestion that we lobby for xml:defaultLang suggests that you might not be understanding exactly what we're proposing; there would be no way to do this with only an attribute, because there's no way of specifying which elements it would apply to. What we suggest (in it simple form) is something like this:

<tagUsage gi="form" defaultLang="wya"><gi>form</gi> elements in this document are Wendat unless otherwise specified.</tagUsage>
<tagUsage gi="def" defaultLang="la"><gi>def</gi> elements in this document are Latin unless otherwise specified.</tagUsage>

This is the sort of thing <tagUsage> is for, isn't it? Two of its current attributes are now pretty much obsolete, since their values can be calculated automatically by any processor. It used to have @render to specify a default <rendition> element for the element (pretty much the same sort of thing as we propose), but that was obsoleted by the <rendition>/@selector attribute (IIRC). Now it languishes with not much to do. This would be a useful thing it could do.

We already have <langUsage>, in which you can make vague statements about the languages occurring in the document; all we're suggesting is a more precisely-targetted method of doing this, which would be especially useful for dictionary collections.

Incidentally, there are no TEI namespaced attributes. All attributes defined in the TEI Guidelines are in the empty namespace.

sydb commented 4 years ago

I think of this as little more than a syntactic sugar idea to make encoded files smaller and more manageable. Seems perfectly reasonable. (Although @martindholmes’ assertion that he would convert this new mechanism to @xml:lang before interchange seems to me to be an argument against, rather than in favor — TEI is about interchange. If you just need this for the convenience of your encoders, not for interchange, why does it have to be in the Guidelines? That said, if everyone is going to do this, may as well do it the same way.)

I can see arguments both for doing this in <langUsage> and with <tagUsage>; I am undecided on that issue. But certainly an XPath (or CSS) selection mechanism, rather than just @gi, is far more powerful and thus more useful.

I don’t understand at all, @duncdrum, what practice this allows that you think is bad. Certainly a file that uses the proposed @defaultLang is perfectly interchangeable, just as a file that uses <prefixDef> is (whether @martindholmes chooses to convert to full URLs and propagated @xml:lang or not.)

duncdrum commented 4 years ago

@martindholmes I have followed the discussion on the list, and your op gives a good summary:

a scenario in which all instances of a particular element are in one language, and all instances of another in another language, in a particular file. Rather than specify @xml:lang repeatedly throughout the document, it would make more sense to declare it once

I m against this for basically three reasons, which i was a bit to terse to express so here it goes in full:

Wrong Place

I m not kidding, the point of mentioning the xml-working-group is because I do believe that this would be the place to make the change you are seeking not here. I don't see why a single @xml:defaultElementLang attribute could not fulfil what you are seeking. Such an element would defined the language of all elements with the same Q-name within the same document.

<TEI xml:lang="en">
…
  <body>
    <entryFree>
      <orth>I came<orth>
      <def xml:defaultElementLang="la">veni</def>
    </entryFree>
    <entryFree>
      <orth>I saw<orth>
      <def>vidi</def>
    </entryFree>
    <entryFree>
      <orth>I conquered<orth>
      <def>vici</def>
    </entryFree>
  </body>
</TEI>

The question of the empty namespace is a red herring. The difference with say @type is that the rules for @xml:lang's contents and its scope are not defined by the TEI, @type is. So let's call it tei-scoped to be more clear.

@sydb This is why i think this leads to bad markup. Without expanding the proposed @defaultLang attribute to actually insert @xml:lang attributes where they should be we have ambiguous encoding.

<TEI xml:lang="en">
  <teiHeader>
    <langUsage>
      <language ident="la" defaultLangpattern="//def">Latin</language>
    </langUsage>
  </teiHeader>
  <body>
    <entryFree>
      <orth>I came<orth>
      <def>veni</def>
    </entryFree>
    <entryFree>
      <orth>I saw<orth>
      <def>vidi</def>
    </entryFree>
    <entryFree>
      <orth>I conquered<orth>
      <def>vici</def>
    </entryFree>
  </body>
</TEI>

According to xml rules there is no latin in the second example, the suggestion would allow for all kinds of patterns to mean that latin text is in this dictionary and processors will have to know about the TEI version and rules because according to TEI there will then be. While I have no doubt that @martindholmes wouldn't produce such output, once it is in the Guidelines someone somewhere will. This is not a good design choice for an interchange format.

Xpath queries can rely on the rules for @xml:lang to be uniform in any xml document, with the addition of @defaultLang they no longer can.

Wrong Scope

In virtually all of our projects, we have encoder-level versions of files, where we make full use of all the efficiencies and conveniences available to us (centralized -ographies, prefix definitions, etc.), and versions generated from these which are intended for interchange

Yet there are numerous ways to deal with it at the project level:

create all defs in a separate document with xml:lang="la" on a common ancestor and import via one of the reference or pointer mechanism available.
the decision how and if to transform encoder-level files via xslt/xquery, but adding an attribute to all def is trivial, and should stay a project level decision. Without xml:lang in the right places these are shady xml files to begin with, so i see no reason to demand that they should be shining TEi in the first place.
use project ODD to implement the OP, again no reasons why other projects need to do this the same way

In all cases in interchange format is necessary, presumably in TEI, that has the lang attributes where they belong. How to get there is beyond the scope of the Guidelines imv and should be left to individual projects.

Too little benefit

We all agree, that in the real world the computational overhead of repeating @xml:lang is not even measurable. So this is an aesthetic preference I even share your dissatisfaction with.

By introducing this convenience method I can imaging numerous ways to create bad markup where the mechanism is used in an unforeseen way. (how about using positional predicates for the xpath pattern, all even numbered <div type='page'> is en all others are la) ? The aesthetic benefits just don't justify introducing a means to create markup we all consider bad imv.

martindholmes commented 4 years ago

@duncdrum I think you're mostly responding to the XPath variant of this proposal rather than the <tagUsage> variant. I would be happy with the simpler version myself. You're right that it's easy to do this at a project level with customization, but whenever a customization looks like it might be generally useful across many projects, it's surely worth discussing it as a possible TEI feature.

As far as this is concerned: "in the real world the computational overhead of repeating @xml:lang is not even measurable": it's not the computational overhead I'm worried about, it's the issue of human encoders having to deal with three lines of text which include 20 @xml:lang attributes, all of which are perfectly predictable, but will have to clutter up their interface while they're encoding. That clutter in itself makes transcription and encoding slower and more error-prone, and that of course will lead to "bad markup".

duncdrum commented 4 years ago

@duncdrum I think you're mostly responding to the XPath variant of this proposal rather than the <tagUsage> variant.

Indeed, my general objection is based on TEI as a data interchange format, aka xpath. <tagUsage> seems primarily concerned with rendering. I can imagine wanting to apply localization features in a browser to all elements as if they had a certain language set. Use vertical page layout, but the two are different concerns. Rendering all defs with Latin menus, is different from saying the contents of the def are in Latin. I d prefer all language related info to be in one place <langUsage> but ultimately I don t care where in the header this would happen. All go against my understanding of the data model.

I would be happy with the simpler version myself. You're right that it's easy to do this at a project level with customization, but whenever a customization looks like it might be generally useful across many projects, it's surely worth discussing it as a possible TEI feature.

Always happy to discuss these things with you, as I tend to learn a lot from it.

As far as this is concerned: "in the real world the computational overhead of repeating @xml:lang is not even measurable": it's not the computational overhead I'm worried about, it's the issue of human encoders having to deal with three lines of text which include 20 @xml:lang attributes, all of which are perfectly predictable, but will have to clutter up their interface while they're encoding. That clutter in itself makes transcription and encoding slower and more error-prone, and that of course will lead to "bad markup".

All the more reason to push for this change where it belongs, the w3c but not here. Also all issues of representation, but we shouldn’t break the data model to fix these, but the ui. Lots of room for improving i18n and l10n within TEI, while we re at it.

sydb commented 4 years ago

Although I am far from sure this is a great idea, I find your arguments against, @duncdrum, unconvincing.

wrong place: Well, you are right that it would be lovely if W3C created a mechanism for default languages. But a) we do not have any control over W3C, we have a lot of control over TEI; b) it is not uncommon for good ideas to appear out in the world and then eventually make their way into W3C (most egregiously XPath, which derives from the TEI extender pointer mechanism); and c) given that W3C no longer has an XML activity lead, I am not going to hold my breath, even if it is a good idea and someone does propose it directly to W3C.
[bad encoding]: I’m sorry, but what you are saying is essentially “if TEI says DUCK implies WADDLE, then when we have a DUCK we don’t know if there is a WADDLE or not.”. But as long as we pay attention to the rules of TEI, there is no ambiguity. That is, the (clever) veni/vidi/vici example has no Latin only if we choose to ignore the very rules we are trying to establish. Yes, if someone processed such a TEI document as an XML document without knowing anything about the TEI, they would lose some of the information (in this case, some language information). But that is true any time you process a TEI document without knowing it is TEI: you would not know how to use the information in <prefixDef> or <tagUsage>; would be confused as to why the value of @age is not a number; might well have no way to figure out that <l> is a metrical line, <line> a physical line, and <lb> the beginning of a physical line; and would probably think that <caption> would be used to indicate the caption of an image or figure. Heck, the same is true of processing a DocBook, SVG, KML, or MathML without access to the definitions and meanings of those languages.
wrong scope: Actually, here I think you may be right. Still chewing on this one.
Too little benefit: Not clear to me either way. Yes, the benefit is small. But honestly, the cost is not that high.

Note that I am presuming that the very real risk of actual ambiguity is handled by explicit rules in TEI that say what something like the following means.

<TEI xml:lang="en">
  <teiHeader>
    <langUsage>
      <language ident="la" defaultLanguagePattern="//def">All <gi>def</gi> elements are in Latin</language>
    </langUsage>
  </teiHeader>
  <body>
    <entryFree>
      <orth>I came<orth>
      <def>veni</def>
    </entryFree>
    <entryFree>
      <orth>I saw<orth>
      <def xml:lang="la">vidi</def>
    </entryFree>
    <entryFree>
      <orth>I conquered<orth>
      <def xml:lang="es">conquisté</def>
    </entryFree>
  </body>
</TEI>

I think we can all agree that a) “veni” and “vidi” are in Latin, and b) we better have rules that tell us whether “conquisté” is in Spanish or Latin.

duncdrum commented 4 years ago

@sydb //*[@xml:lang='la'] will get me all latin in a mix of docbook, DITA, SVG, and TEI documents. My point is exactly that after the proposed change it won't. <prefixDef>, <line> etc are tei-scoped (and namespaced for real this time), @xml:lang is not.

Also vidi is latin, veni inherits the xml:lang attribute from TEI it is explicitly defined as en. There is no @version so in 3.5 its in english in 3.6 it could be latin, hence the ambiguity.

sydb commented 4 years ago

On first para: While true, doesn’t help convince me at all. Remember that @xml:lang is neither required for, nor required to be the only mechanism for, language identification.

On second para: Frikin’ good point. (And good reason why people should specify a version.)

duncdrum commented 4 years ago

On first para: While true, doesn’t help convince me at all. Remember that @xml:lang is neither required for, nor required to be the only mechanism for, language identification.

Wanna bet a pint/large cup of tea at our next f2f, that all the widely used schemas you mentioned do specify @xml:lang to designate languages although no one required them to do so. My bet is they do.

martindholmes commented 4 years ago

@duncdrum

//*[@xml:lang='la'] will get me all latin in a mix of docbook, DITA, SVG, and TEI documents.

No it won't. It will get you all the elements which have @xml:lang='la' declared on them, but those elements may have descendants with other @xml:lang values. In a lot of documents, it would get you a mixed mess of languages.You want:

//*[@xml:lang='la' or (not(@xml:lang) and ancestor::*[@xml:lang][1]/@xml:lang='la')]

or something like that.

sydb commented 4 years ago

(@martindholmes is correct, a more complicated XPath like his is needed, but that does not change the validity of @duncdrum’s point.)

@duncdrum: Of course I won’t take the bet, and of course you are correct. It is not that I missed that point, it’s that I don’t care. :-) I am positing that it is not reasonable to put the needs of those who want to use one simple XPath to extract content of a particular (natural) language from documents that are encoded in multiple (XML) languages over those of TEI users. TEI is not intended to be interoperable with DITA systems, it is intended to be interchangable with other TEI systems.

That said, please remember I am not actually arguing in favor (or against, for that matter). I am just thinking these issues through aloud (as it were). I am still undecided on this one.

martindholmes commented 4 years ago

I wonder what @laurentromary would think of this. It's really most useful for people encoding multilingual dictionaries. Laurent?

laurentromary commented 4 years ago

Thanks for the nudge @martindholmes . I was just following from a distance. In the TEI Lex 0 scenario we are working on a scenario of a network of dictionaries which we want to be able to process (query) uniformly. We thus try to avoid any kind of magic in the representation. For instance, we enforce that each entry has an @xml:lang attribute so that this information is not inherited by accident. In that case, the @xml:lang information represents the object language (the language about which the entry is) and must be superseded by actual working languages when it differs (the example). We do prefer redundancy to uncertainty in the representation I must say and are used to have @xml:lang reproduced all over the place. I would definitely be reluctant to use @defaultLang in such situations. Does this help?

martindholmes commented 4 years ago

Thanks @laurentromary. I guess you're one more vote against the idea, then! Perhaps it's better as a project-level customization.

laurentromary commented 4 years ago

At project level when editing an initial lexical resource, that could make sense if you want to reduce the work load of editors checking that @xml:lang is correctly set where it should. But then an edition specific customization could also do that for you.

martindholmes commented 4 years ago

OK, I'll add it to my local customizations. Thanks everyone!

TEIC / TEI