@xml:lang - Githubissues

SJagodzinski commented 3 years ago

Language of Element

Replace xml:lang with optional attribute @languageOfElement with data type NMTOKEN. Use @languageOfElement in all non-empty elements.

Creator of issue

Silke Jagodzinski
TS-EAS: EAC-CPF subgroup
silkejagodzinski@gmail.com

Related issues / documents

Remove xml ns to align with EAD 3 #27 @xml:lang: adopt EAD 3 solution #28 Language codes: adopt EAD 3 solution #29 @scriptCode: remove and adjust tag library for @xml:lang/@lang Attribute #30

EAD3 Reconciliation

Summary: Indicates the language of the content of an element. Content of the attribute should be a code taken from ISO 639-1, ISO 639-2b, ISO 639-3, or another controlled list, as specified in the langencoding attribute in . May be used consistently in a multi-lingual finding aid to specify which elements are written in which language. Available on all non-empty elements. Data Type: NMTOKEN

Context

@xml:lang XML Language

Summary: Two-letter language code from the IANA registry as dictated by the W3C specification.

Description and Usage: The xml:lang may occur on any element intended to contain natural language content whenever information about the language of the content of this element and its children are needed. xml:lang should be used when the language of the element differs from the Language Code declared in the languageCode attribute on the element within the element. The values in the list are taken from the IANA Registry (http://www.iana.org/assignments/language-subtag-registry). The use of the IANA Registry code for languages in this context is outlined in the W3C specification. The syntax is specified at: http://www.w3.org/International/articles/language-tags/.

Data Type: IANA Registry for language codes.

Solution documentation: agreed solution for TL and guidelines

Summary: Indicates the language of the content of an element. Content of the attribute should be a code taken from ISO 639-1, ISO 639-2b, ISO 639-3, or another controlled list, as specified in the langencoding attribute in <control> . May be used consistently in a multi-lingual entities description to specify which elements are written in which language. Available on all non-empty elements.

Data Type: NMTOKEN

May occur within: <abstract>, <address>, <addressLine>, <agencyCode>, <agencyName>, <agent>, <alternativeSet>, <biogHist>, <chronItem>, <chronItemSet>, <chronList>, <citedRange>, <componentEntry>, <contact>, <contactLine>, <conventionDeclaration>, <date>, <dateRange>, <dateSet>, <description>, <descriptiveNote>, <event>, <eventDateTime>, <eventDescription>, <existDates>, <fromDate>, <function>, <functions>, <generalContext>, <geographicCoordinates>, <head>, <identityId>, <item>, <language>, <languageDeclaration>, <languageUsed>, <languagesUsed>, <legalStatus>, <legalStatuses>, <list>, <localControl>, <localDescription>, <localDescriptions>, <localTypeDeclaration>, <maintenanceAgency>, <maintenanceEvent>, <maintenanceHistory>, <mandate>, <mandates>, <nameEntry>, <nameEntrySet>, <occupation>, <occupations>, <otherAgencyCode>, <otherEntityType>, <otherEntityTypes>, <otherRecordId>, <p>, <part>, <place>, <placeName>, <placeRole>, <places>, <recordId>, <reference>, <relation>, <relationType>, <representation>, <rightsDeclaration>, <setComponent>, <shortCode>, <source>, <sources>, <span>, <structureOrGenealogy>, <targetEntity>, <targetRole>, <term>, <toDate>, <useDates>, <writingSystem>

Example encoding

fordmadox commented 3 years ago

The more I think about it, the more I think it's a mistake to follow EAD3 on this one. I don't think that we should ignore https://www.w3.org/TR/xml-i18n-bp/, specifically this recommendation:

It is not recommended to use your own attribute or element to specify the language of the content. The xml:lang attribute is supported by various XML technologies such as XPath and XSLT (e.g. the lang() function). Using something different would diminish the interoperability of your documents and reduce your ability to take advantage of some XML applications.

I've got the alpha schema set up to use the new attribute names, but I would also like to eventually create a branch of the schema that removes all of those attributes (aside from languagecode and scriptcode) and instead uses the "xml" namespace as intended.

Although we could continue to have EAD/S continue to do its own things and ignore best practices, it seems like a bad idea not to make the standard more interoperable with other XML standards like TEI, DITA, DocBook, MODS, etc., all of which use xml:lang, as well as RDF and other data serializations that also seem to have settled around doing the same. Why make it more difficult to move between all of those and require a local mapping to do so (and lose out on built in features in XPath, etc.)? Just my two cents 😄

kerstarno commented 3 years ago

I have to admit that I am still not convinced about the argument's strength to merit newly introducing @xml:lang in a future version of EAD.

Assuming that we did, a few additional thoughts:

In my opinion, if we (re)introdcue @xml:lang, we should include ALL attributes from the XML namespace, i.e. also (re)introducing @xml:id and @xml:base and @xml:spacenot only @xml:lang.
We still would have to think about the EAS' recommendation with regard to which language and script codes to use as we'll also have to think about @languageCode and @scriptCode as a standardised representation of the <language> and <writingSystem> elements used with <languageUsed>, <langmaterial>, and <languageSet>.
We would need to think about this anyway, as the only specification that the XML namespace itself gives is that @xml:lang is character data. The recommended values themselves come from IANA.
Looking at the definition of xsd:language as mentioned in #97 for its use by RDF, this is more specific than what's given with the XML namespace. xsd:language uses a pattern - [a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})* - against which it validates (see https://www.w3.org/TR/xmlschema11-2/#language) and the following formats seem to be the most common:
- For ISO-recognised languages, the two- or three-letter, (usually lowercase) language code that conforms to ISO 639, optionally followed by a hyphen and a two-letter, (usually uppercase) country code that conforms to ISO 3166. For example, en or en-US;
- For languages registered by IANA, i-langname, where langname is the registered name. For example, i-navajo;
- For unofficial languages, x-langname, where langname is a name of up to eight characters agreed upon by the two parties sharing the document. For example, x-Newspeak.
- (See: http://www.datypic.com/sc/xsd/t-xsd_language.html)

kerstarno commented 3 years ago

Btw - just found this in the MODS user guide (https://www.loc.gov/standards/mods/userguide/attributes.html#lang):

citation starts

lang @lang indicates the language of the content of an element, using a code from ISO 639-2/b.

Example

<name type="personal">
<namePart type="given">Jack</namePart>
<namePart type="family">May</namePart>
<namePart type="termsOfAddress">I</namePart>
<description lang="eng">District Commissioner</description>
<description lang="fre">Préfet de région</description>
</name>

xml:lang @xml:lang serves the same purpose as @lang, but follows the W3C documentation that indicates using the IANA language subtag registry, which includes codes from the ISO language and script standards.

Example

<titleInfo xml:lang="fr" type="translated">
<nonSort>L'</nonSort>
<title>homme qui voulut être roi</title>
</titleInfo>

citation ends

Assuming that we do not want to use both attributes next to each other and given that we've decided to open up the options of how languages could be encoded (i.e. not only IANA, but also the three variations of ISO 639 plus other language encodings), I'd be back at using an attribute of our own rather than going back to @xml:lang.

fordmadox commented 3 years ago

The TEI guidelines provide a great overview here about how they encode languages: https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CH.html#CHSH (which stresses the following: "For maximal compatibility with existing processes, the identifier for the language must be constructed as in Best Current Practice 47")

As time goes on, I grow more convinced that it's better to keep the "xml" namespace in EAC for id, base, lang, and adding space, since I don't really see the need for EAC/D to ignore that convention (and to make it more difficult to share data). In the two examples from MODS, the first won't work, for instance, if I want to use something like the built-in "lang" function from XPath (https://www.w3.org/TR/xpath-functions-31/#func-lang) to determine the language, whereas the second one does.

All that said, we've got languageOfElement and scriptOfElement in the development branch of EAC, which aligns it with the path taken by EAD.

kerstarno commented 3 years ago

Just as a note: "the path taken by EAD" only means not having introduced the XML namespace when defining EAD3. :-)

As for potentially going back on the decision with regard to XML namespace, this would mean:

To keep the XML namespace in EAC-CPF; and to add @xml:space;
To introduce the XML namespace in EAD (next version) with all four attributes;
To decide whether we want:
- To keep the newly introduced name @scriptOfElement;
- To revert back to using @script as in current EAD3 (I wouldn't use @scriptCode for providing the ISO code for the script named in the element <writingSystem> and for providing an ISO code relating to the content of any other element as EAC-CPF currently does);
- To skip the attribute for providing an ISO code relating to the writing system of the content of any element completely and to point users to the IANA registry option for @xml:lang which would enable them to encode language and script in one value if script information is seen as essential.

kerstarno commented 3 years ago

Tested as part of Schema Team's schema testing:

@xml:lang does not exist anymore in the draft schema
@languageOfElement is used with 86 out of 89 elements in the draft for EAC-CPF 2.0
- @SJagodzinski this adds <control> (#81), <cpfDescription> (#79), <eac> (#78), <identity> (#114), and <relations> (#210) to the 81 elements listed above in the solution documentation. Could you please confirm if these should or should not have the language attributes?
- In case, these should have the language attributes, could you please confirm if <multipleIdentities> (#80) should maybe have them, too (see also below)? @xml:lang currently is available for <multipleIdentities> (same as the other elements mentioned) in EAC-CPF 1.0.
- In case, these should not have the language attributes, could you please confirm if <description> (#138) should then still retain the language attributes as the only high-level wrapper element?
The three elements that do not have @languageOfElement are:
- <multipleIdentities> - to be clarified (see above)
- <entityType> - as it does not have text
- <objectXMLWrap> - as its sub-elements are not from the EAS namespace
The attribute's data type is NMTOKEN
@languageOfElement is always available alongside @scriptOfElement (#152)

The above applies to both schemas, RNG and XSD.

SJagodzinski commented 3 years ago

@fordmadox , @kerstarno : Please keep the lang attributes as they are: not available in <mulitpleIdentities>, <entityType> and <objectXMLWrap>

List will be completed

kerstarno commented 3 years ago

@SJagodzinski thanks for the confirmation.

With this, the attribute is ready.

@fordmadox please take note of <multipleIdentities> not having language attribution in EAC-CPF 2.0 anymore, i.e. we will need to think about a transformation strategy in this case.

SJagodzinski commented 3 years ago

Recommendation of IETF language tags needs to be discussed, also with respect to feedback from the CfC.

SJagodzinski commented 2 years ago

Asked community about use of IETF language tags in @languageOfElement (which replaces @xml:lang) in call for comments and did not receive any feedback.

EAC-CPF team meeting, 8 Aug 2021:

Agreed to recommend the use of IETF language tags in @languageOfElement, create entry in Best Practice Guide for this. EAD team will follow EAC-CPF decision.

SAA-SDT / eac-cpf-schema

@xml:lang #151

Language of Element

Creator of issue

Related issues / documents

EAD3 Reconciliation

Context

Solution documentation: agreed solution for TL and guidelines

Example encoding