multi-lingual tei and xml

duncdrum commented 7 years ago

Follow up on #1707. The xml specs define the scope of @xml:lang to include attribute values. The P5 guidelines however ignore this and only cover the element's content: (language) indicates the language of the element content using a ‘tag’ generated according to BCP 47.

This creates further problems for multi-lingual documents, and for the recommendations concerning ODD customisation. Since most default attribute values are 'en' it is possible for users to create valid tei documents like this:

...
<body xml:lang="en">
  <div type="節版">man man lai</div>
</body>

to be clear the following are fine:

<body xml:lang="zh">
  <div type="節版">慢慢来</div>
</body>

<body xml:lang="zh-Latn">
  <div type="jieban">manmanlai</div>
  <div type="節版" xml:lang="zh">慢慢来</div>
</body>

The xml specs clearly encourage multi-lingual markup by allowing a mix of element and attribute names from different languages (if only from an empty language space combined with one natural space). I take the empty language space pertain to strings that are attribute values that are meaningless in a given natural language. The tei guidelines do not address when it considers this to be the case though. I think that attribute values that are equivalent to natural language vocabulary have to be considered as such, ie

'example' = 'en'
'#refMyBook' = ''
'T001' = ''

Otherwise we can have:

<body xml:lang="en">
  <div type="節版">man man lai</div>
</body>

combined with this custom odd

<elementSpec ident="div" mode="change">
  <attList>
    <attDef ident="type" usage="rec" mode="add">
       <attList>
        <attDef ident="節版" mode="change" xml:lang="de">
            <gloss>Das xylographische Register</gloss>
      <desc>Das Register eines schnurgebundenen Xylographischen Kodex</desc>
            ...
        </attDef>
    </attList>

So my tei document is defined as english, with the attribute value 節版 explained in german. Yes, i hate my readers :) Not only do I think that this is bad xml in violation of the specs, it also moves tei away from the common academic practice of providing translation and/or transliteration when working across linguistic boundaries.

Alternatively, a) the guidelines should reflect the xml specs, in that xml:lang extends to attribute values. b) the guidelines should be updated regarding i18n and l10n strategies to make users aware that by defining @xml:lang on any element, it's attribute values (default, suggested, or otherwise mandated) need to be adopted accordingly.

duncdrum commented 7 years ago

@lb42

Well, as the passage I quoted above shows, it is up to an application to decide what it thinks. In the case of the TEI (and this is what we need to make more explicit) "weight" is NOT english. It is "TEIspeak". I think you're attaching too much weight to the @desc="leise" example. I nmight well (if I am doiing a multilingual project) have an ODD in which "leise" "lit" "reads" etc are all available as possible values for the @desc attribute on . That still doesn't make any of them belong to the xml:lang-defined language space.

I don't see how to interpret the example in the xml specs differently, or how the guidelines benefit as an interchange format by defining all natural language terms to be @xml:lang="TEI". This helps neither human nor machine parsers. desc is not derived from any german word, who is clearly derived from english, leise is clearly german and not derived from english or the empty language space.

lb42 commented 7 years ago

Sorry if I am still not making the point clear. I am not for a moment suggesting that the Guidelines say anything about "all natural language terms". I am suggesting that the Guidelines should be more explicit in saying that all names of elements, all names of attributes, and all tokens predefined as possible attribute values belong to a single language space, which is independent of the value of xml:lang. In much the same way that all attribute values with a numeric or temporal datatype do,

duncdrum commented 7 years ago

@lb42 I m ok with making explicit that names of elements and attributes are in a single language space independent of xml:lang. Which is exactly what the examples in the xml specs show.

I don't see how adding:

all tokens predefined as possible attribute values

has any benefits. It is stretching the guidelines conformance with the xml specs, by enabling some really funky markup, for what? What use case could possibly benefit here? Is it really too much to ask not to say:

<measure commodity="銅" type="weight">

but either

<measure commodity="copper" type ="weight"> or
<measure commodity="銅" type="重量">

just to clarify

Your argument would have us consider that e.g. <encodingDesc xml:lang="de"> was also incoherent.

No xml:lang covers the elements content, and attribute values, you example declares a language space but doesn't show anything that falls into this language space (since it is only the opening tag of an element with no further attributes, if you add@type='dictionary' you violate the specs, if you add @type="Wörterbuch" you don't, since 'type' comes from the Guidelines you are no longer free to make it @Typ="Wörterbuch" without further odd customisations).

lb42 commented 7 years ago

Attribute values are constrained in the Guidelines in various different ways. Firstly, and obviously, they have datatypes, so they have to be numbers or dates or whatever. But the "whatever" includes a range of "text-like" options. You can have "any old string of unicode characters" or you can have "a single token" or you can have "a single token taken from a predefined list" (teidata.string, teidate.word, teidata.enumerated). The last two are the most frequent. In the second case, you supply in your ODD a list of the types you want to accept, or recommend, or which you consider exemplary. Mostly, the choices you make in your ODD will be enforced by the schema generated from it. Where that isnt the case, you could add Schematron constraints, or fall back on hand-waving recommendations. I regard your desire for a recommendation that attribute values should come from a single language-space as falling into the latter category. And why should they? Is it unreasonable to say that there is a fixed number of meaningful different values for @type on <measure> ? And to propose (English-like) names for those types? I don't think so. Clearly it is much less reasonable to come up with a fixed number of possible types of commodity, so why not permit me to use any value I like, including some for which a non-English name might be more appropriate? (I might want to say "eau-de-vie" rather than "brandy", for example). So I don't buy the notion that having these tokens appear to come from different language spaces matters a hoot.

lb42 commented 7 years ago

p.s. The XML spec says "The language specified by xml:lang applies to the element where it is specified (including the values of its attributes)," . I am just guessing but this presumably cannot apply to an attribute with a declared datatype of (say) float or integer or date or IDREF. My argument in a nutshell is that the same applies to teidata.word and teidata.enumerated, and for the same reasons.

duncdrum commented 7 years ago

so why not permit me to use any value I like

because just as with float or IDREF the xml specs limit what you can put in there. @xml:id="111" is not valid, neither is @factor="一。三" or @some:float="៧" or @when="28年-11月-7日"

My argument in a nutshell is that the same applies to teidata.word, and for the same reasons.

You are free to encode a german shopping list in TEI using "eau-de-vie" and/or "brandy". But not without declaring valid @xml:lang attributes on elements, where the values of other attributes come under the same scope

bansp commented 7 years ago

Apart from what Lou says, the XML 'rule' invoked here is seriously messed up by claiming all the attribute space at one go, irrespective of the role and semantics of those attributes (why not have a @translation_equivalent attribute defined as providing a gloss of the content of the element?), or of their provenance/namespace. Please don't try to push the TEI into the mess that this thoughtless rule creates. (We won't let you ;-))

duncdrum commented 7 years ago

@bansp I don't consider the rule messed up at all. I find the notion that latin alphabet with english vocabulary, arabic numerals for all xs:ints and gregorian calendar notation for all xs:date is somewhat universal and neutral rather thoughtless.

by mandating that users express, eg. numerals in a symbolic system that is foreign to their native language, you already put the burden on them. This rule goes some way towards acknowledging that. And since the TEI claims "any texts, any language, any form" I don't see how us ignoring that part of the spec is helping any body.

The rule is what it is. Why not kill all xs:datatypes, or move TEI away from xml all together.

bansp commented 7 years ago

I find the notion that latin alphabet with english vocabulary, arabic numerals for all xs:ints and gregorian calendar notation for all xs:date is somewhat universal and neutral rather thoughtless.

It is more or less as 'neutral' as having the north pole placed at the top. I don't think anyone has claimed that if you create a system for a non-European culture, you are bound to use these English-language values. But please don't say that you would use non-English/localized values because of that XML rule, which was ignorant of notions like internationalization and localization; it was just a quick way to move on with the spec.

The rule is what it is.

Yeah, like several other places in XML, it's messed up. If the TEI had been more mature back when it was in a cradle and no one cared about i18n, it would have influenced XML and XML would be less messed up. But it didn't and we have to live with the mess, until we move on.

cmsmcq commented 7 years ago

I will try not to deal with every disagreement or misunderstanding exhibited here and limit myself to observing:

1 The XML spec does not require that the value of @xml:lang be interpreted as applying to all attributes -- I take the intent to be that it applies to those with free natural-language content.

The spec says that xml:lang is used to "specify the language used in the contents and attribute values of [the] element". It seems to me that this can apply only to content and attribute values which do in fact use a language.

IDs, IDREFs, numeric values, ISO date values, enumerated values from a closed list, and the like seem to me out of scope for xml:lang, even though enumerated values (like element and attribute names) are often drawn from a natural language. The same reasoning seems to me to apply equally to values from a semi-closed list ('suggested values').

2 On the question of which content and attributes are to be taken as using a language (and thus to be taken as being described by xml:lang), the spec says

Applications determine which of an element's attribute values and
which parts of its character content, if any, are treated as
language-dependent values described by xml:lang.

It seems to me to follow that TEI is acting consistently with the XML spec if it says that xml:lang applies to content but not to attribute values; that is, if it specifies that the set of attribute values to be "treated as language-dependent values described by xml:lang" is the empty set.

3 The phrase 'language-dependent values described by xml:lang' might bear either of two interpretations:

3.1 If we take 'described by xml:lang' as a redundant gloss on 'language-dependent values', so that applications are expected to say only what is language-dependent and what is not, then it would be dubious practice to say that an attribute X is not described by xml:lang even though its values are documented as being a natural-language description of something. (I have not found any such attributes in P5 in the few minutes I've just spent looking.)

3.2 A more aggressive interpretation of the spec might take 'described by xml:lang' as non-redundant, so that applications are empowered to classify particular attributes and bits of content as

(a) language-dependent values described by xml:lang, (b) language-dependent values not described by xml:lang, (c) non-language-dependent values

I doubt very much that the records of the discussion within the XML WG would show clearly that the WG made a clear choice between 3.1 and 3.2. Rereading this part of the spec now, 3.1 seems to fit better with the wording of the rest of section 2.12, but I don't see a way to prove beyond doubt that an application that relies on interpretation 3.2 is violating the letter or the spirit of the XML spec.

4 In general, natural-language content does much better as content than as attribute values. Internal structure like bidi and ruby markup is possible for content but not for attribute values, not to mention the other kinds of fine-grained markup defined by TEI. And many pointer mechansisms do better with content, so it's more easily used as the target of hyperlinks.

For those reasons, I am not deeply troubled by the proposition that xml:lang is problematic because it makes it harder to define multiple attributes with the same material in different languages. On reading 3.1, the proposition is true; on reading 3.2 it's not. But in either case, it is quite likely to be a bad idea (and the interpretation of xml:lang is the least of its problems).

5 The proposition that enumerated values (and possibly others) are not described by xml:lang because they are not English, or German, but some other language one could call 'TEI', is plausible enough. I wouldn't want to go there because it means that essentially every element will be potentially or actually multilingual, which makes this view a bad match for the view implicit in the specification of xml:lang, which is that the material in scope for any particular value of xml:lang will often be monolingual.

To those who wish the XML spec presented a more nuanced view of language use in XML documents, I can only say I wish so, too. In 1997, when the XML WG borrowed the global lang attribute of TEI P3 and wrote it into the XML spec, it seemed like useful progress to get some recognition even of the simple cases into the spec. A really nuanced account would have been complicated enough to scare off many WG members (even if anyone in the WG had been in a position to specify one, which I doubt).

6 The proposition that all attribute values (not just unconstrained natural-language phrases) within the scope of an xml:lang value should be translated into that natural language would make polyglot documents essentially impossible to validate using current validation technologies. For the technologies of 1997, delete 'essentially'.

I think that amounts to a reductio ad absurdum of the proposition. The alternative would be to assume that the XML WG intended xml:lang to be incompatible with the use of DTDs for validation. That assumption is difficult to reconcile with the text of section 2.12.

I apologize for the length of this posting. But this is the short version.

bansp commented 7 years ago

Hello Michael. Thanks for your gentle and informative note. I've found the P3 statement on `@lang':

lang indicates the language of the element content, usually using a two- or three-letter code from ISO 639

so it seems right to assume that the XML WG made it 'just a little bit wider' in application. :-) And I fully appreciate that it was better then to gain at least a little foothold (the more so that it occupied as much of the terrain as was imaginable).

In order to address Duncan's legitimate multilingual (and multi-cultural) concerns, maybe the ITS (Internationalization Tag Set) is worth mentioning as the way to handle exactly such issues in a kind of overlay on the existing specification: https://www.w3.org/TR/its20/

I could imagine bringing this up for consideration in the LingSIG: we could see whether and how the current Guidelines could implement levels 1 and 2 of the ITS. And when I say "in the LingSIG", I don't mean to be in any way exclusive but rather to indicate the technical forum to discuss this and potentially create a structured feature request.

duncdrum commented 7 years ago

@cmsmcq thank you for the calm response to this heated debate. I guess @jamescummings and I will have to exchange catties reciprocally.

re 1) I agree. The problem with semi-closed list for attribute values in the context of tei lies in the fact that the guidelines encourage defining and glossing possible values via odd which is TEI, which in turn is xml.

re 2) I still don't see what TEI could possibly gain by doing this. The xml specs don't claim to pertain to "all documents, in any language, or form without restrictions" tei does. In all other instances, we follow the xml specs more closely. Why this exception for @type or @commodity, but not @unit and a cluster of special elements, and attributes for @when? But, I accept that it is consistent with the specs. In fact if every markup is per definition in an empty language space, why bother at all with xml:lang. @tei:dialect-with-army could allow Lou to use whatever he wants to as its value including iso-codes.

re 3) I take it that so far everybody is in agreement that the guidelines should be more specific in which interpretation they adopt.

re 4) agreed. TEI, however, has natural language content as attribute values, so the question is in how far our current practices help machine or human readers to parse this. I still think that precluding all attribute values from being under the scope of @xml:lang does more harm good. I would love to see a concrete example including element definition, contents, and attributes where mixing actually solves an encoding problem. I only deal with multi-lingual documents and encodings, and have never encountered the need for this.

re 5) agreed, both as plausible, but not where I would go given the whole "all documents, any language, and form" claim s.a.

re 6) this seems to be a slippery slope. I don't think that even in mono-lingual documents, current validation technologies can perform precise tests as to wether the language of an element's content, actually matches it's description in @xml:lang. Irregardless, the example that started this whole debate <measure unit='斤' commodity="銅" quantity='2516' type='weight' xml:lang="ja">銅二千五百十六斤十両二分四銖</measure> could theoretically test the string contents of all textual attribute values to check for unicode ranges. weight would no fall under ja. See 1) what's the benefit to TEI of encouraging a natural language mismatch between the gloss explaining attribute values and those very values? In any other academic context, explaining incommensurable or unique concepts such as 道 is accompanied by translations or transliterations, why should TEI break with that convention?

I see a few possible solutions to this issue:

@bansp yes I would greatly welcome the use of ITS as part of the guidelines. There are problems with ITS as well but it would certainly be a big step forward.
Instead of maintaining that the xml specs don't really mean what they say the mean, and TEI doesn't really do what it looks like it does. We could simply update the xml:lang recommendations to match the wording of the current xml specs, instead of the P3 specs. This would entail informing users that if their intended attribute values appear in a dictionary or can be used in a sentence, they are in fact under the language scope of at least that natural language, and need to apply tags accordingly.
Following @bansp earlier comments, we remove ourselves more aggressively from the shackles that is xml conformance, and allow users to tag numbers, and dates however their documents do, and use whatever they want for attribute values.

P.S.: @bansp not sure where you were going with the maps example. None of the maps in the documents that I work on have north pointing up, neither do any others from that continent for a good thousand years or so.

bansp commented 7 years ago

not sure where you were going with the maps example

It was not an example. Rather a simile to show that I share your concern concerning a potential a-reflexive approach to multicultural issues with vision narrowed by somewhat ingrained Euro-centric traditions. However, I expressly do not share what I perceive as a literalist approach attempting to mould a living, evolving, well-researched framework to forms set by a bible that was at the time of its creation largely ignorant of many of the concerns that the current TEI addresses.

duncdrum commented 7 years ago

@bansp, I don't know who that "we" is you speak for, nor why you continue to insinuate about my motivations, instead of addressing the issue at hand. If you or the LingSIG want to host a discussion about ITS, the bible, or messed-up agendas, please open a separate issues, to have that discussion there.

The issue, as mentioned in the OP, is that the P5 guidelines diverge in their definition of the scope of xml:lang from the one found in the xml specs. One says element contents, the other says element contents including attribute values.

This has lead to discussions about how to apply xml:lang in tei documents. There are two diverging interpretations, i.e. Michael's 3.1, 3.2. Both seem in line with the xml specs, but neither one is self-evident. Whatever Piotr's or my view on the best option is, can we at least agree that the guidelines should be more specific, to avoid similar confusion among editors in the future?

I'm willing to prepare a PR in line with either 3.2. or 3.1, if council can agree on which one it should be. The options seem pretty straight forward:

either the note and example are updated to reflect the opinion that all tei attribute values are per definition TEIspeak and therefore don't fall under the scope of the accompanying xml:lang.
Or we account for the fact that by allowing "any convenient method" for constructing certain attribute values, TEI can include natural language scoped attributes, in which case the description and example should be updated to show and elaborate on this.

bansp commented 7 years ago

@duncdrum I thought it would be impolite not to reply to "the maps example", so I mistakenly did. I admit that I did that together with me.

continue to insinuate about my motivations

nice one!

hcayless commented 7 years ago

If I can speak with my Council hat on, I think it's quite clear from the XML Spec that the scope of @xml:lang is all content of the element, including its attributes. BUT @cmsmcq's point 1 must be correct. If in XHTML one had

<form xml:lang="de">
...
</form>

you wouldn't therefore expect to see inside it

<input type="Knopf">

because the types of input in an HTML form are drawn from a constrained list, essentially a datatype. To do otherwise would be madness. The argument I'd make is that any attribute with a TEI-defined datatype is out of scope for @xml:lang in the same way that @type="button" is for XHTML. Only attributes directly defined as xs:string, xs:token, or xs:normalizedString (pretty sure there are none of the latter) would be in scope. There are only a small handful of these, and some of those are further constrained. As part of the development of P5, there was a concerted effort to remove "free-text-bearing" attributes, and so there should be few, if any, that are in scope for @xml:lang.

The upshot of this is that the Guidelines text is correct for practical purposes, even if the technical picture is more complex. Only the textual content of elements is affected by @xml:lang because none of the attributes contain free text. That said, I'm all for being more precise in the GLs language, as long as it won't confuse people.

duncdrum commented 7 years ago

@hcayless I was pondering a very similar example before raising the issue. Looking at the list of potentially in-scope teidata.text attributes it includes valItem/@ident which is necessary to construct constrained lists in odds.

So in many practical cases teidata.enumerated attributes such as @type will be constrained in TEI within the language scope of the relevant section in the ODD, which further constrains the number of possible cases for language clashes between attribute values or element contents. To be clear I think that this is a good thing.

As for the question if any attribute value in tei can be under the scope of xml:lang, if I understand you correctly you are saying yes in the case of teidata.text?

Interestingly enough teidate.word attributes are not defined as string or token, so wouldn't necessarily be in-scope candidates according to your definition. (I personally think that maybe some attributes should be switched from that list to enumerated, or text, but that is a separate issue)

Addendum: so far this discussion seems to ping-pong between everything or nothing being under the scope of xml:lang when it comes to TEI attribute values. I m not convinced by either position, hence my desire to be more specific when attribute values are language scoped in tei and when they are not.

hcayless commented 7 years ago

I could possibly be persuaded that teidata.text attributes should be in @xml:lang's scope, but first I'd like to circle back to an earlier point that zipped by me on my initial read. The XML spec states that

Applications determine which of an element's attribute values and which parts of its character content, if any, are treated as language-dependent values described by xml:lang.

I'm pretty sure "Applications" here means not software agents consuming XML documents, but XML-based languages, like TEI, so surely as an XML application, TEI is within its rights to say what the scope of @xml:lang is?

Anyway, I agree with your addendum, and I think the solution will have to do with whether TEI datatypes are in scope or not. The content model for teidata.word is:

<content>
 <dataRef name="token"
  restriction="(\p{L}|\p{N}|\p{P}|\p{S})+"/>
</content>

So they are xs:tokens, with the additional restriction that whitespace is banned. My feeling is that even attributes that are defined as teidata.text could be constrained, and any customization that used them might be well-advised to implement constraints on them. The Guidelines just don't necessarily have anything to say about what those constraints should be.

duncdrum commented 6 years ago

Coming at this from another part of the guidelines.

The scope of the language identification extends to the whole subtree of the document anchored at the element that carries the xml:lang attribute, including all elements and those attributes, if any, where a language might apply. (This excludes all attributes where a non-textual datatype has been specified, for example tokens, boolean values, dates, and predefined value lists.)

Looks like the definition in CH and in ref-att.global are not in sync. We now have a third definition next to the xml: specs. elements (art-ref.global), elements + attribute values (xml specs), elements + attributes (ch).

"where language might apply" seems very broad and more in line with the view that not everything in the TEI is a priori tei:speak. So are xs:strings and teidata.text textual datatypes or not, apparently teidata.word and xs:tokens aren't? How are encoders to define value lists via odd without having the value of @ident being language scoped? Especially when these values are taken directly from a list of convenient natural language terms.

If I am really the only one seeing a problem here, then go ahead and close this issue.

dariok commented 6 years ago

I think that it is very necessary to make sure that our own definition in the GL is consistent. But I think that this will entail quite some work.

The reason why we need a detailed discussion and in the end possibly reassignment of certain attributes to different data types struck me when thinking about @type, which is teidata.enumerated which is defined as teidata.enumerated = teidata.word and about the latter we already said they are based on xs:tokens and hence likely fall under the scope of @xml:lang.

And I am deeply troubled by saying that the value of @type is (or has to be) in the same language as the text's content. Assume you are encoding a deed in Middle High German. You would put @xml:lang to state the language. One structural part of a deed often is an 'arenga' (sometimes, esp. in the former West Roman Empire, called a 'prooemium') which is a rather modern term based on an Old Italian word not present in MHG. Actually, there is no word in MHG for this very concept. Thus, if we assume @type to fall under the regime of @xml:lang, we will run into problems as it is quite likely that the language of the content does not have an adequate word for the concept to be described.

So, if we are going to go down this path and excplicitly declare which data types are to be within the scope of @xml:lang (which may well be the right thing to do), we will have to go through the attributes affected and discuss them in detail to see the effects.

lb42 commented 6 years ago

It should certainly be made explicit that attributes like @type with a datatype of teidata.enumerated have non-linguistic values, and are therefore not affected by xml:lang. I don't think there's any need for much more than that.

duncdrum commented 6 years ago

@lb42 two questions:

so what are the attributes "where a language might apply" mentioned in the GL, if not teidata.word?
why are teidata.numeric attributes not accepting numerics in "in any natural language, of any date, in any literary genre or text type", via their token definition, which simply repeats the restriction to American Standard via \d?

duncdrum commented 6 years ago

@dariok I see the problem that you describe, and come across similar problems often myself (dealing with Chinese sources). Thankfully we are not the first ones to face this kind of problem.

How do you normally deal with foreign concepts, or the sources technical vocabulary when submitting an English language article to a publisher? My own rule of thumb, if its english enough for the a publisher it's ok as attribute value, providing multi-lingual gloss and desc elsewhere in the tei. I.e. use a different attribute value and define arenga elsewhere in the odd or tei file, where you can put another @xml:lang attribute on it.

hcayless commented 6 years ago

I do think we should be very explicit about which attributes (and text content of elements for that matter) fall under the scope of @xml:lang, and further, I think that the right way to do that is to be explicit about which TEI-defined datatypes (if any) might be in its scope. To be clear, I don't think there's anything wrong with the prose you quote, @duncdrum, because there's no way of knowing what attributes a random TEI document is using. It might have locally-defined attributes we know nothing of, for example.

It seems to me that we're generally agreed that constraining attribute values takes them out of the realm of free text and therefore presumably out of the scope of a language tag, because even if all the values are taken from a particular single language, the act of constraining the set of values has (presumably) shrunk the set of available terms down to the point where it doesn't work as a proper human language anymore. The terms available as values for @break on <lb/>, ( 'yes' | 'no' ), might be taken from English, but do not themselves constitute a human language, and therefore cannot be said to be en-US or anything else.

For various good reasons, TEI has moved away from what you might call "free text" attributes. I'd like to think (though I'm not sure I could promise it), that P5 doesn't have any of them. I'd define a "free text" attribute here as one whose values are not subject to being constrained, and so could be drawn from any human language, and therefore might be in scope for @xml:lang. We do have many attributes that the Guidelines do not constrain, because those constraints are up to the project or encoder to specify. But that does not make them "free text" attributes.

What I'd like to suggest is that TEI should:

Specify in the Guidelines that all attributes with values currently set to teidata.* are out of scope for @xml:lang.
Consider creating a new datatype, maybe teidata.freetext that is explicitly in scope for @xml:lang. If you wanted an attribute to be governed by @xml:lang, you could redefine it in your ODD to be teidata.freetext. I don't think the Guidelines themselves would use it, except in examples.
Check that we're not using xs:string or xs:token directly as datatypes where we shouldn't be.

duncdrum commented 6 years ago

@hcayless the only thing "wrong" with the section I quoted in my view is that it clashes with the section on @xml:lang quoted in the OP. They should be more consistent with respect to each other, instead of "all attributes where language might apply" in one, and "element contents" in the other, they should both address "attribute values" (following the wording of the xml specs). If the CH section talks about attributes that currently don't exist, that should be clarified as you suggested. If for the purposes of validating against tei-all, attributes that are intended to be constrained by projects, count as constrained even in cases where projects don't put any constrains on them, that seems worth pointing out as well.

So while not my preferred choice, 1 to 3 do provide a solution to this issue. I'd still like to suggest adding:

Check the token constructors of other teidata.* attributes, and determine if they are needlessly restrictive in assuming ASCII for all. Unicode covers many more ways of, e.g. counting then [0-9] or \d so why not use them in tei:speak.

dariok commented 6 years ago

@duncdrum: I don't think your earlier analogy with submitting a paper is completely accurate here. If I encode a text in a modern, living language, it might work. You can introduce the words and explain it which usually is the way to go when using foreign concepts.

When encoding a text in a 'dead' language*, there may be nothing you can do: I cannot introduce a valid Middle High German word to say 'arenga'. Hence, I cannot give a useful @type to my, say, tei:div that contains the arenga.

Thinking a bit more about this, I've come to the conclusion that a rule that xml:lang covers the values of all attributes indiscriminately is rather flawed. Two of the reasons I think so:

Consider an English text that contains a stretch of text in Greek which you want to tag as a person's name. You cannot use <rs type="person" xml:lang='grc-Grek'> as 'person' clearly is neither Greek nor written in Greek. You would have to introduce a new value, possibly only used in this location, and declare - in a way that is easily processed - that it means exactly the same as the one used elsewhere. When it comes to extracting the data, this will quickly evolve into a nightmare.

Furthermore, how would one determine which words actually 'belong' to a language? If you use a dictionary, which one? Only few languages such as French and German have a form of 'prescribed' language (and German has significant differences between Germany, Austria, and Switzerland as does English between the UK and US).

As far as I can see, there is no real way out of this if you stick to such a very strict rule. Hence, I support the idea to clearly define which attributes the TEI cosiders to fall into @xml:lang's scope.

*By 'dead' I mean a language in which new words cannot formed except by 'primitive' means, usually creating compounds. Thus, Latin counts as dead as well as Middle High German: you cannot create new words that can go through a process to assure they are understood by all thouse using that language. I am well aware that this is not a valid linguistic definition but it serves to illustrate my point.

duncdrum commented 6 years ago

@dariok yes working on dead languages can be tricky (the example that started all this is from a 9th century Japanese source … It's dead Jim). My point is that this wasn't news to people in the 70s, nor is it really unprecedented circumstances nowadays.

Despite all the difficulties you mentioned, editors manage in practice to settle questions of US vs GB spelling. Academic disciplines have established practices for calling something an arenga although the source itself could not possible have access to this concept. They tend to have similar convention concerning the use of glosses or translations when talking about arenga in modern english prose. The intention behind my example was to show that producing a TEI edition, just as writing an article, requires us to navigate these conceptual gaps between dead language source, and contemporary concepts.

I m not suggesting that TEI attempts to formalise any such disciplinary conventions, I m also not proposing indiscriminate dicta for ALL attributes. But in order to live up to the "any text, any language, any time" motto, the Guidelines should, in my view, be more explicit about how to tackle such issues.

The nightmare that you describe cuts both ways in this. Declaring all attribute values to be tei:speak allows for combinations that no human or machine parser can handle, that break with established academic practices, and that effectively expect all encoders to be able to operate in English as its the closest natural language to tei:speak.

I m not sure what is more madness <form type="Knopf"/> or

<div type="T1" xml:lang="gmh">Uns ist in alten mæren wunders vil geseit...</div>
...
<taxonomy xml:id="deed">
  <category xml:id="T1">
    <catDesc xml:lang="en">arenga</catDesc>
    <catDesc xml:lang="la">prooemium</catDesc>
  </category>
</taxonomy>

<div xml:lang="en"> when in Greece I stumbled upon 
this line <foreign xml:lang='grc-Grek'>
εἰσιν καὶ <rs type="ἰδιώτης" >ἰδιῶται</rs> 
ἐθαύμαζον ἐπεγίνωσκόν</foreign> 
I wonder what it means
</div>

But either way, my point about the GLs prose needing some love seems to have come across.

dariok commented 6 years ago

@duncdrum I think, we're not too far away. As I said earlier, I completely agree that there is a need to make things clear and unambiguous in the GL. And I agree that it is not a good idea to consider everything to be tei:speak. I really think the best approach might well involve going through all attributes and see what the ramifications are of considering it as one or the other. While this will certainly a lengthy process and will involve cases that are far from clear-cut, I think it's necessary to tackle the problems we have at hand.

Regarding your first example: T1 is not a valid word in Middle High German ;)

While reworking the wording of the GL, the note in http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-teidata.enumerated.html should be updated as well to show that it is actually mandatory to provide a list of definitions and that its values are not in the scope of @xml:lang. In doing so, it might be necessary to remove the definition teidata.enumerated = teidata.word which to me implies identity, including properties - and we seem to have agreed that teidata.word is likely to be under the language regime.

hcayless commented 6 years ago

@duncdrum I'm afraid I see no ambiguity in the passages you're discussing. The first states that @xml:lang affects the content of the element bearing it; the second that attributes of elements which are descendants of an element with @xml:lang might be in its scope. There's no contradiction there. Attributes of a child element are part of the content of the parent.

Furthermore, apart from the language tags in your examples above (which I'm afraid are mostly wrong), I don't see any problem. You're absolutely free to have a type="ἰδιώτης" if you want, and if that makes sense for your project. You can even pick words from multiple languages and scripts if you want. It's a good idea for the sake of everyone's sanity if you're consistent, of course. I do take your point about numeric content of attributes, but remember that attributes are generally there to provide things like normalized values, so that, e.g. software agents can do math with them, and those agents, for historical reasons, tend to prefer numbers formatted with the character range [0-9].

I have thought of one attribute that might sensibly be governed by @xml:lang, and that's @lemma on <w>, which seems like it could be in scope for @xml:lang, though even there we might not want it to be.

sydb commented 6 years ago

But if that’s the case, @hcayless, then it is more an argument for changing w/@lemma to a child element than allowing it to be goverend by @xml:lang, per the War on Attributes.

(The War on Attributes was the deliberate effort by the TEI to avoid any and all attributes whose value would be “free text” precisely because one might have to specify the language of such text, and because one might need to represent a character outside of Unicode within such text. There were many @casualties, but from the ashes sprung some interesting constructs, like <choice>.)

martindholmes commented 6 years ago

I agree with @sydb here, but @lemma isn't the only one; there's also @reason:

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-supplied.html#tei_att.reason

and I'm sure there are a few more.

bansp commented 6 years ago

Hi Hugh, @lemma is actually not free text, it's a conventional identifier (not necessarily unique, or potentially unique in connection with other features inside an entry, or, in this very case, in connection with grammatical attributes that are on the plate, or at least getting closer to the plate with each passing month... ;-) ).

hcayless commented 6 years ago

@sydb, @martindholmes, @bansp: I didn't actually say I thought @lemma was free text, I said @xml:lang might apply to it. @lemma is clearly constrained, it's just that it's constrained to the (rather large) set of word forms in a language that might serve as headwords in a dictionary of that language. It's essentially a key. BUT, I still think it might (though maybe not always) make sense to have it in the scope of @xml:lang.

@martindholmes, EpiDoc restricts @reason. I don't see why you wouldn't...

martindholmes commented 6 years ago

@hcayless I agree that @reason should be restricted, but the base definition of "one or more words" encourages people to use it for free text, I think. Perhaps we could constrain that a bit,

ebeshero commented 4 years ago

Council recommends discussing this with the internationalization group to work on drafting a recommendation.

JanelleJenstad commented 3 years ago

Council SVF2F subgroup recommends planning to discuss this ticket on an upcoming Thursday Council mtg, for which everyone has read the ticket in advance and comes prepared to discuss.

hcayless commented 3 years ago

Council thinks the answer to this is that for TEI, the vast majority of attribute values are by definition out of scope for @xml:lang. TEI almost universally avoids natural language attribute values. But we should make sure the definition of @xml:lang is in sync with the prose, so we will do that.

TEIC / TEI

multi-lingual tei and xml #1721