eric-muller / udhr

Universal Declaration of Human Rights
6 stars 4 forks source link

Language tags for Kurdish #4

Closed brawer closed 6 years ago

brawer commented 6 years ago

Currently, the Unicode UDHR project uses different BCP47 tags for Kurdish than what would be the result of running a language tag normalization algorithm such as the one implemented by ICU. To fix this, I'd like to propose the following three changes:

  1. in the UDHR in Central Kurdish, change xml:lang="ku" to xml:lang="ckb-Latn";

  2. in the UDHR in Northern Kurdish, change xml:lang="kmr" to xml:lang="ku";

  3. on the translations overview page, change the BCP47 column for kmr_arab (which has no XML file yet in the UDHR project) from kmr-Arab to ku-Arab.

For Kurdish, CLDR contains the following entry in supplementalMetadata.xml, which means that the ku language tag stands for Northern Kurdish:

<languageAlias type="kmr" replacement="ku" reason="macrolanguage"/>
<!--  Northern Kurdish ⇒ Kurdish  -->

And in likelySubtags.xml, CLDR contains the following entries, which means that Latin is the default writing system for ku. This matches what we already have in the UDHR project, so just quoting this for reference.

<likelySubtag from="ckb" to="ckb_Arab_IQ"/>
<!-- { Central Kurdish; ?; ? } => { Central Kurdish; Arabic; Iraq } -->

<likelySubtag from="ku" to="ku_Latn_TR"/>
<!-- { Kurdish; ?; ? } => { Kurdish; Latin; Turkey } -->

<likelySubtag from="ku_LB" to="ku_Arab_LB"/>
<!-- { Kurdish; ?; Lebanon } => { Kurdish; Arabic; Lebanon } -->

<likelySubtag from="ku_Arab" to="ku_Arab_IQ"/>
<!-- { Kurdish; Arabic; ? } => { Kurdish; Arabic; Iraq } -->
eric-muller commented 6 years ago

Codes in [] are the UDHR in Unicode keys.

Changing bcp-47 for [ckb] from "ku" to "ckb-Latn": ok, will do.

Changing bcp-47 of [kmr] from "kmr" to "ku" (and similarly for [kmr_arab]): It seems to me that bcp-47 prefers a specific language over a macrolanguage. That Unicode Language Identifiers prefers something else is fine, but it's not bcp-47 any longer. Is it correct that for the purposes of this project, the difference between the bcp47 tags and the uli tags is only the use of the macrolanguage code in uli for the "dominant" language of the group? If so, it seems that conversion the conversion between the two is lossless and mechanical, and that we only need to record one of the two.

macchiati commented 6 years ago

The Unicode Language Identifiers are still valid bcp47 language tags; they just prefer choice of tag according to industry practice. For example, very rarely is "arb" used for Arabic.

The downside of using "non-shortest form" is that implementations that don't canonicalize the UDHR language tags will get mismatches.

eric-muller commented 6 years ago

The Unicode Language Identifiers are still valid bcp47 language tags

Yes, but the same string can mean something different in the two systems. I don't think it's wise to put uli semantic under the label bcp47 and conversely.

My question stands: for this project (where we choose the identifiers and don't have to deal with legacy), it is possible to convert mechanically and without change of meaning between the two?

macchiati commented 6 years ago

The canonical form of the Unicode language identifiers will use zh, ar, etc. instead of cmn, arb, etc. There is mapping data to go from the non-canonical form to the canonical form. As long as you don't distinguish zh and cmn, etc. then implementations should be ok. If you do have a data file for cmn that is different than one for zh, then breakage would ensue.

The raw bcp47 semantics for codes like zh are just underspecified compared to CLDR's (and frankly, better handled by codes like zhx).

Suggest two options:

  1. Switch to using the Unicode language identifiers. OR
  2. Add documentation that: UDHR never distinguishes two languages by codes that would be canonically equivalent Unicode language identifiers (such as cmn and zh). Implementations that use ULI can canonicalize the UDHR codes to prevent problems interpreting the data.
eric-muller commented 6 years ago

Here is what I propose:

The details of the changes are below, and the resulting index.xml is attached. index.txt

What do you think should be the xml:lang/lang attribute in HTML files? On the one hand, HTML5 specifically says those are BCP47 tags; on the other hand, it looks like in practice they are ULI tags, and it comes down to a choice between being formally correct vs. being useful.

Fix iso639-3 of a few translations

FIx bcp47 of a few translations: from 'sq' to 'als' from 'ar' to 'arb' from 'ay' to 'ayr' from 'az-Cyrl' to 'azj-Cyrl' from 'az-Latn' to 'azj-Latn' from 'bik' to 'bcl' from 'za' to 'zyb' from 'ku' to 'ckb-Latn' from 'zh-Hans' to 'cmn-Hans' from 'zh-Hant' to 'cmn-Hant' from 'eml' to 'rgn' from 'et' to 'ekk' from 'om' to 'gaz' from 'kpe' to 'gkp' from 'gn' to 'gug' from 'iu' to 'ike' from 'mn-Cyrl' to 'khk-Cyrl' from 'mn-Mong' to 'khk-Mong' from 'kmr' to 'kmr-Latn' from 'kg' to 'kng' from 'kg' to 'kng-AO' from 'lv' to 'lvs' from 'nah' to 'nhn' from 'oj' to 'ojb' from 'ps' to 'pbu' from 'fa' to 'pes' from 'fa-AF' to 'prs' from 'mg' to 'plt' from 'rom' to 'rmn' from 'rom' to 'rmn' from 'sc' to 'src' from 'sw' to 'swh' from 'yi' to 'ydd'

Finally introduce a new attribute uli, which is for the most part equal to bcp47, but differs on those: bcp47=tw-akuapem uli=ak-akuapem bcp47=tw-asante uli=ak-asante bcp47=fat uli=ak bcp47=arb uli=ar bcp47=ayr uli=ay bcp47=bcl uli=bik bcp47=zyb uli=za bcp47=cmn-Hans uli=zh-Hans bcp47=cmn-Hans uli=zh-Hans bcp47=cmn-Hans uli=zh-Hans bcp47=cmn-Hans uli=zh-Hans bcp47=cmn-Hans uli=zh-Hans bcp47=cmn-Hans uli=zh-Hans bcp47=cmn-Hant uli=zh-Hant bcp47=emk uli=man bcp47=ekk uli=et bcp47=gaz uli=om bcp47=gug uli=gn bcp47=hea uli=hmn bcp47=ike uli=iu bcp47=khk-Cyrl uli=mn-Cyrl bcp47=khk-Mong uli=mn-Mong bcp47=kmr-Latn uli=ku bcp47=kmr-Arab uli=ku-Arab bcp47=knc uli=kr bcp47=kng uli=kg bcp47=kng-AO uli=kg-AO bcp47=lvs uli=lv bcp47=npi uli=ne bcp47=ory uli=or bcp47=pbu uli=ps bcp47=pes uli=fa bcp47=prs uli=fa-AF bcp47=plt uli=mg bcp47=pnb uli=lah bcp47=quz uli=qu bcp47=mup uli=raj bcp47=src uli=sc bcp47=swh uli=sw bcp47=uzn-Cyrl uli=uz-Cyrl bcp47=uzn-Latn uli=uz-Latn bcp47=ydd uli=yi

markusicu commented 6 years ago

On Sat, Oct 14, 2017 at 6:41 AM, Eric Muller notifications@github.com wrote:

Here is what I propose:

  • fix a few iso639-3, to avoid macro languages. However, there is still at least one (que for [qud])
  • fix bcp47, to account for recently introduced tags, and avoid macro languages
  • introduce a uli attribute

The details of the changes are below, and the resulting index.xml is attached.

What do you think should be the xml:lang/lang attribute in HTML files? On the one hand, HTML5 specifically says those are BCP47 tags; on the other hand, it looks like in practice they are ULI tags, and it comes down to a choice between being formally correct vs. being useful.

When you write ULI, you mean Unicode Language Identifiers defined by CLDR, right? Not ULI = http://uli.unicode.org/ ...

Unicode language IDs are BCP 47, with very minor extensions that should not matter here, and I think we should use the guidance established by CLDR.

markus

eric-muller commented 6 years ago

Yes, Unicode Locale Identifiers.

No they are not the same. zh in BCP 47 means "one of a number of Chinese languages, including mandarin, cantonese, etc.", while in ULI it means just mandarin. At least, that what I conclude when ULI tells me not to use cmn and to use zh instead.

markusicu commented 6 years ago

On Sat, Oct 14, 2017 at 5:44 PM, Eric Muller notifications@github.com wrote:

Yes, Unicode Locale Identifiers.

Please avoid using "ULI" for them, to avoid confusion with the ULI project :-)

No they are not the same. zh in BCP 47 means "one of a number of Chinese languages, including mandarin, cantonese, etc.", while in ULI it means just mandarin. At least, that what I conclude when ULI tells me not to use cmn and to use zh instead.

I believe that BCP 47 says no such thing. I am confident we can claim that common industry practice in tagging with BCP 47 language tags is to use "zh" to mean "Chinese" and imply Mandarin in simplified characters, if nothing else is specified.

markus

macchiati commented 6 years ago

It would be extremely misleading to contrast 'bcp47' with 'uli', especially for this case.

In ISO 369-3, in theory a file named 'ar.xml' using bcp47 could contain "auz" data (Uzbeki Arabic, population 700, acc. to SIL), or any other of the two dozen languages encompassed in 'ar' = 'ara' (see http://www-01.sil.org/iso639-3/macrolanguages.asp). It could also contain "arb" data.

But of course, that would massively break compatibility. There is provision for this in BCP47:

o Each encompassed language's subtag SHOULD be used as the primary language subtag. For example, a document in Mandarin Chinese would be tagged "cmn" (the subtag for Mandarin Chinese) in preference to "zh" (Chinese).

o If compatibility is desired or needed, the encompassed subtag MAY be used as an extended language subtag. For example, a document in Mandarin Chinese could be tagged "zh-cmn" instead of either "cmn" or "zh".

It was and continues to be predominent IT industry practice to follow #2 above, using 'ar' for Standard Arabic, 'zh' for Standard Chinese, ... Because, well, we care deeply about compatibility.

That is the policy we follow with Unicode language identifiers. There are systems that use 'cmn' over 'zh', and 'arb' over 'ar' but they are a minority. I'm guessing mostly bibliographic systems.

I have no objection to having two different labels, but they are both bcp47, so you can't use the term 'bcp47' contrastively. Maybe something like:

bcp47 (uli) = Unicode language identifiers, compatibility with IT systems a concern. bcp47 (broad) = Non-ULI, compatibility with IT systems not a concern

====

Side note, and personal opinion: I suspect that the 'ar' ≠ Standard Arabic interpretation arose in the bibliographic area, with people feeling 'forced' to pick the closest available tag. And the macrolanguage construct was an attempt to accommodate both that and IT usage. In hindsight, I think a better solution would have been to define separate collection codes like the following, instead of (from an IT perspective) redefining some core language codes to effectively be collection codes. (A bit like our attempt in Unicode to have unambiguous Line and Paragraph separators; well-intentioned, but too problematic for compatibility to be successful.)

Type: language Subtag: zhx Description: Chinese (family)

Mark https://twitter.com/mark_e_davis

On Mon, Oct 16, 2017 at 5:34 PM, Markus Scherer notifications@github.com wrote:

On Sat, Oct 14, 2017 at 5:44 PM, Eric Muller notifications@github.com wrote:

Yes, Unicode Locale Identifiers.

Please avoid using "ULI" for them, to avoid confusion with the ULI project :-)

No they are not the same. zh in BCP 47 means "one of a number of Chinese languages, including mandarin, cantonese, etc.", while in ULI it means just mandarin. At least, that what I conclude when ULI tells me not to use cmn and to use zh instead.

I believe that BCP 47 says no such thing. I am confident we can claim that common industry practice in tagging with BCP 47 language tags is to use "zh" to mean "Chinese" and imply Mandarin in simplified characters, if nothing else is specified.

markus

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/unicode-org/udhr/issues/4#issuecomment-336925344, or mute the thread https://github.com/notifications/unsubscribe-auth/AJKyMGfkhKpipSL5tj2KmzZaV8xdnBF-ks5ss3eDgaJpZM4Po1Wz .

eric-muller commented 6 years ago

Fixed as requested, as well as a few other tags along the same lines.

The iso639-3 attribute avoids metalanguages as much as possible, and can be used to resolve metalanguages in BCP 47 tags.

kargaranamir commented 1 year ago

Reference

As far as i know ckb even historically not written in Latin.

These 2 files are identical: