adlnet / xAPI-Spec

The xAPI Specification describes communication about learner activity and experiences between technologies.
https://adlnet.gov/projects/xapi/
905 stars 405 forks source link

2.4.6 - Language String Clarification #921

Closed jhaag75 closed 8 years ago

jhaag75 commented 8 years ago

In Section 2.4.6:

language String (as defined in RFC 5646) Code representing the language in which the experience being recorded in this Statement (mainly) occurred in, if applicable and known. Optional

It refers to RFC 5646 for language codes, but there are MANY options. On a call recently, with the Video CoP they were trying to find out how the language code should be stored in the context property.

http://tools.ietf.org/html/rfc5646

For example should it be

en?

en_US

en-us?

Which types of subtags should be used (if any)? Rather than expecting readers of the xAPI spec to open this RFC5646 and read it before getting an answer can't we just specify how much of the optional language region subtags should be used? The Video CoP requested that the spec be updated to clarify this.

garemoko commented 8 years ago

As a general principal when populating properties (not just for language maps) you should be as specific as possible but no more. So if you know it is specifically US english being used then use en-US; if you know it's english then use en; if you really don't know the language at all then use und.

As for en_US vs en-us vs en-US, all the example in the spec (and beyond) use en-US.

No objections at all to making things clearer.

fugu13 commented 8 years ago

Starting with the easy parts:

Underscores are not legal in RFC 5646 values. en-US and en-us are both fine (in terms of format).

Use any sub tags that make sense for your situation. Generally speaking, for the context value, I'd recommend the language tag plus a country level sub tag (the two letter country code), if you know the language+country for the system. und is a language tag for undefined if you don't know either, but in context I'd just leave it out instead.

For the keys in other areas, I recommend, if the term is widely used across most regions using a language, just using the top level language tag, such as en. If the term varies a bunch (course, degree program, etc), use top level language tag plus country/region tag.

People who do a lot of l10n/i18n work will know when to think about doing other things. Some of the advice above falls down in, for example, China, where you need two language tags just to get to the dialect (read: language), so you'll almost always want something like zh-cmn (mandarin) or zh-yue (cantonese) or similar to start with, and go from there.

This advice isn't in the RFC 5646 spec, which isn't for giving advice, so I definitely don't expect people to read the RFC 5646 spec, I expect them to google it ;). There will not be a universal answer, but a few pointers in descriptive language make sense.

There's some good advice in http://eidr.org/documents/EIDR_Language_Code_Best_Practice_v1.6.pdf we might steal parts of, such as

Where possible, use the shortest valid language code for a given situation. If only the primary language family (or macro-language) is known for certain, do not guess at the possible extended language sub-tags or region codes.

They also have a "common languages" list for easy lookup in many common situations.

jhaag75 commented 8 years ago

Thanks Andrew


+1.850.266.7100(office) +1.850.471.1300 (mobile) jhaag75 (skype) http://jasonhaag.com (Web) http://twitter.com/mobilejson (Twitter) http://linkedin.com/in/jasonhaag (LinkedIn)

On Tue, May 3, 2016 at 2:56 PM, Andrew Downes notifications@github.com wrote:

As a general principal when populating properties (not just for language maps) you should be as specific as possible but no more. So if you know it is specifically US english being used then use en-US; if you know it's english then use en; if you really don't know the language at all then use und.

As for en_US vs en-us vs en-US, all the example in the spec (and beyond) use en-US.

No objections at all to making things clearer.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/adlnet/xAPI-Spec/issues/921#issuecomment-216646488

jhaag75 commented 8 years ago

Thanks Russell. I like the notion of stealing/reusing, "Where possible, use the shortest valid language code for a given situation. If only the primary language family (or macro-language) is known for certain, do not guess at the possible extended language sub-tags or region codes."


+1.850.266.7100(office) +1.850.471.1300 (mobile) jhaag75 (skype) http://jasonhaag.com (Web) http://twitter.com/mobilejson (Twitter) http://linkedin.com/in/jasonhaag (LinkedIn)

On Tue, May 3, 2016 at 3:07 PM, fugu13 notifications@github.com wrote:

Starting with the easy parts:

Underscores are not legal in RFC 5646 values. en-US and en-us are both fine (in terms of format).

Use any sub tags that make sense for your situation. Generally speaking, for the context value, I'd recommend the language tag plus a country level sub tag (the two letter country code), if you know the language+country for the system. und is a language tag for undefined if you don't know either, but in context I'd just leave it out instead.

For the keys in other areas, I recommend, if the term is widely used across most regions using a language, just using the top level language tag, such as en. If the term varies a bunch (course, degree program, etc), use top level language tag plus country/region tag.

People who do a lot of l10n/i18n work will know when to think about doing other things. Some of the advice above falls down in, for example, China, where you need two language tags just to get to the dialect (read: language), so you'll almost always want something like zh-cmn (mandarin) or zh-yue (cantonese) or similar to start with, and go from there.

This advice isn't in the RFC 5646 spec, which isn't for giving advice, so I definitely don't expect people to read the RFC 5646 spec, I expect them to google it ;). There will not be a universal answer, but a few pointers in descriptive language make sense.

There's some good advice in http://eidr.org/documents/EIDR_Language_Code_Best_Practice_v1.6.pdf we might steal parts of, such as

Where possible, use the shortest valid language code for a given situation. If only the primary language family (or macro-language) is known for certain, do not guess at the possible extended language sub-tags or region codes.

They also have a "common languages" list for easy lookup in many common situations.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/adlnet/xAPI-Spec/issues/921#issuecomment-216649438

garemoko commented 8 years ago

Seems like this is ready for a PR. Anybody want to have a stab before 1.0.3 closes?

jhaag75 commented 8 years ago

Since I opened it, I can try to take a stab later today based on your suggestions...will send something by CoB. Stuck in a meeting until 4PM. Would like to send a draft to this issue ticket first before submitting the PR though.


+1.850.266.7100(office) +1.850.471.1300 (mobile) jhaag75 (skype) http://jasonhaag.com (Web) http://twitter.com/mobilejson (Twitter) http://linkedin.com/in/jasonhaag (LinkedIn)

On Wed, May 4, 2016 at 1:27 PM, Andrew Downes notifications@github.com wrote:

Seems like this is ready for a PR. Anybody want to have a stab before 1.0.3 closes?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/adlnet/xAPI-Spec/issues/921#issuecomment-216957179

jhaag75 commented 8 years ago

How about this?

Using RFC 5646, it is possible to construct the same language code in different ways, though choosing the shortest valid language code for a given situation is generally preferred. The language tag plus a country level sub tag (the two letter country code) allows for the designation of basic languages (e.g., “es” for Spanish) and regional dialects (e.g., “es-MX”, the dialect of Spanish spoken in Mexico). If only the primary language family (or macro-language) is known for certain, do not guess at the possible extended language sub-tags or region codes. In other words, if only the primary language is known (e.g., English) then use the top level language tag ("en"). If the term varies depending upon the region use top level language tag plus country/region tag.


+1.850.266.7100(office) +1.850.471.1300 (mobile) jhaag75 (skype) http://jasonhaag.com (Web) http://twitter.com/mobilejson (Twitter) http://linkedin.com/in/jasonhaag (LinkedIn)

On Wed, May 4, 2016 at 1:37 PM, Haag, Jason jhaag75@gmail.com wrote:

Since I opened it, I can try to take a stab later today based on your suggestions...will send something by CoB. Stuck in a meeting until 4PM. Would like to send a draft to this issue ticket first before submitting the PR though.


+1.850.266.7100(office) +1.850.471.1300 (mobile) jhaag75 (skype) http://jasonhaag.com (Web) http://twitter.com/mobilejson (Twitter) http://linkedin.com/in/jasonhaag (LinkedIn)

On Wed, May 4, 2016 at 1:27 PM, Andrew Downes notifications@github.com wrote:

Seems like this is ready for a PR. Anybody want to have a stab before 1.0.3 closes?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/adlnet/xAPI-Spec/issues/921#issuecomment-216957179

fugu13 commented 8 years ago

I think we're starting to get there. A few comments:

it is possible to construct the same language code in different ways

should be struck; different language tags are pretty much always different in meaning.

With that gone, probably just rephrase to: "The shortest valid language code...

The language tag

The entire thing is a language tag, the first bit is an ISO 639 language code (the shortest one for a given language). A list is here: https://www.loc.gov/standards/iso639-2/php/code_list.php

We should probably avoid the term dialect; for many countries, there are many dialects within, even if some usages are typical for the country. We can also link to a complete list there ( https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 seems to be the easiest to browse).

The source material does, but we should probably not use "primary language family" or "macro-language" here, they're introducing new terminology people won't be familiar with. Maybe just say "ISO 639 language code" since we'll have introduced that?

The rest of the advice is good. We should add a short segment on how for Chinese languages the ISO 639 language code is generally insufficient.

I think it's useful enough that we should link to the doc I linked before, and say that there's a list of common language tags in there that can be referred to.

garemoko commented 8 years ago

RE "We should add a short segment on [..] Chinese languages"

Do we want to start on the road of advice for specific languages? I'd rather leave that out.

fugu13 commented 8 years ago

@garemoko the reason to add a Chinese language section is they don't follow the guidelines. On the table of common language tags in http://eidr.org/documents/EIDR_Language_Code_Best_Practice_v1.6.pdf , every language tag outside Chinese is the ISO 639 language code, but the Chinese languages are zh-gan, zh-cmn, etc. That is, you should basically never use bare zh as a language tag, because the linguistic diversity it contains is a lot larger than the linguistic diversity inside, say, Kazakh (language tag kk).

liveaspankaj commented 8 years ago

The following note in the above document doesn't suggest that Chinese is the only definitive exception: NOTE: If it is important to further clarify the particular language spoken (as is often the case with the macro-language Chinese, “zh”), then add a suitable sub-tag to create a compound language tag (“zh-cmn” for Mandarin Chinese).

garemoko commented 8 years ago

I'd be ok with that, but I think @fugu13's point is that it literally just applies to Chinese.

I fixed the typo.

fugu13 commented 8 years ago

Yeah, there aren't any other ISO-639 language codes that are really a bunch of very different languages in that way.

garemoko commented 8 years ago

This can be closed now #930 merged