alpheios-project / alpheios-core

Alpheios Core Javascript Packages and Libraries
15 stars 2 forks source link

Support a full range of ISO 639 codes in the Language class #597

Open kirlat opened 3 years ago

kirlat commented 3 years ago

The Language class should support codes in both ISO 639-2 and ISO 639-3 formats.

The initial requirements are: @balmas:

Since we are now using the Language class to encapsulate languages other than those that initiated their creation (e.g. for a Definition, we have the language of the lemma of the definition and the language of the text of the definition) we should explicitly support ISO 639-2 codes as well as ISO 639-3. For Alpheios-supported languages we normalize to ISO 639-3 but for other languages we should support either, and we could validate against the standard list of 639-3 and 639-2 codes.

kirlat commented 3 years ago

Also: @balmas:

I don't actually know if we should normalize 'en' to 'eng' Was this added for the definition languages?

I think we still have more cleanup to do here with the language codes of the alpheios supported languages --- right now this list has to be updated in too many places when we add support for a new source language. (see https://github.com/alpheios-project/documentation/blob/master/development/adding_a_language.md which is almost certainly out of date). I'm not sure what the best solution is and I don't think it has to be solved with this PR but we should keep sight of it.

kirlat commented 3 years ago

As agreed at the check-in, we will use ISO 639-2 codes as a standard within our application. The Language class should be able to store language codes in any of the ISO 639-1, 639-2, 639-3 formats. That's because third-parties might supply language codes in various formats. The Language class should be able to return a language code in the ISO 639-2 code regardless of in what format the language code is stored internally. The Language class should also be able to perform comparison between language codes in different formats correctly. For that, it has to be able to do a conversion between ISO 639-1, 639-2, 639-3 codes internally.

Are there anything missing from the summary above? Are any corrections required?

irina060981 commented 3 years ago

I believe we should also point the places where we use language as a string. And later update it with Language class. Would point some places, that I know:

balmas commented 3 years ago

A bit more background:

IETF RFC 4646 (https://www.ietf.org/rfc/rfc4646.txt) specifies use of a 2-character code from ISO 639-1 when it exists; when a language does not have a 2-character code assigned the 3-character code from ISO 639-2 is used.

Alpheios has traditionally used the ISO 639-2 3-character code as the standard code for any Alpheios supported languages.

I think it makes sense to continue to use the ISO 639-2 3-character code internally as our standard, but we should be able to interpret and map from the other variants.

kirlat commented 3 years ago

Thanks for the reference to the document, that's very interesting! I hope the ability to map (which I believe exists) between different variants of the ISO 639 would make us flexible enough to satisfy all possible use cases.