language codes - Githubissues

MortenHofft commented 4 years ago

We are currently using 3 letter language codes. That is not enough to describe all the languages we would like to support/describe. An example is zh-TW Chinese traditional/taiwanese. We already have the website translated into traditional Chinese - we do not want to loose this option.

So we need a new enumeration for languages (existing is here). It seems natural to look to Crowdin as they make a living from translations.

marcos-lg commented 4 years ago

New enum is here: http://api.gbif-dev.org/v1/enumeration/basic/TranslationLanguage

Java code here: https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/vocabulary/TranslationLanguage.java

It's deployed in DEV

MattBlissett commented 4 years ago

What's this for? (Meaning the API change more than the vocabulary using it.)

We have language codes for interpreting the language of a vernacular name, which will include minority and dead languages.

I don't know if that's the same thing as the languages we translate the portal / registry into.

marcos-lg commented 4 years ago

@MortenHofft requested it to differentiate between language variants like the Chinese ones (our current Language enum doesn't support that). The vocabularies will also be used to for example populate dropdowns in the UI and the UI uses these variants.

The only reason I put it in gbif-api is for consistency for front-end developers to have this enum in the same endpoint as the others (http://api.gbif-dev.org/v1/enumeration/basic/TranslationLanguage). Should I move it?

MattBlissett commented 4 years ago

I'm not sure if we should add a second language vocabulary to the v1 API. We already have one, and should consider how it might be extended.

It seems a bit arbitrary to choose Crowdin's list of supported languages. There's a mixture of two and three letter codes, a few without countries, and stuff like Upside Down English and "Quenya" which is Lord of the Rings Elvish.

We'll support these APIs and vocabularies/enumerations for years, so it's worth spending the time to get it right.

@timrobertson100, @mdoering, what do you think?

mdoering commented 4 years ago

Yes, I feel similar. There is a prominent open issue in the GBIF API for some time about extending the existing but limited language enumeration: https://github.com/gbif/gbif-api/issues/29

For CoL we have the need to support a wide array of languages for vernacular names. We decided to drop the GBIF language enum and instead go with a large list of 3 letter iso codes (>8000) taken from https://iso639-3.sil.org/code_tables/download_tables. These do not fit into an enum anymore.

This does not solve the problem with simplified and traditional chinese though. These are seen as the same language but using different scripts. So you need a locale to distinguish them.

marcos-lg commented 4 years ago

I think it wouldn't be so easy to extend the current Language enum to accommodate the locales because we'd have to change the default serialization to use the locale instead of the 3-letter-code as we do now and this will break the code that relies on that.

One solution could be to rename this new enum to Locale and don't store the ISO 639-1 and 639-3 codes. Actually, this enum could use the current Language and Country enums so we only support the languages and countries available in those enums (we could add more languages if needed). So I mean this:

Locale(Language lang, Country country) {}

and this Locale will be serialized as something like Language.getIso2LetterCode-Country.get2LetterCode (e.g.: en-US)

timrobertson100 commented 4 years ago

Since we can't fix Language to work for us, I propose we consider marking Language as deprecated with instructions to use a LanguageCode. This is similar to the original proposal but with a naming change and following typical behavior for retiring something still in use.

If we did this, in deprecating we should state that Language is expected to be removed in a v2 GBIF API and LanguageCode can contain a mix of existing Language codes plus the necessary subset of CrowdIn language codes to meet our foreseen needs - adding more in future releases as necessary.

3-letter ISO codes look like a repeat of previous mistakes.

The Locale proposal looks likely to be limited in similar ways to 2 and 3 letter ISO codes (but I recognize the attempt to accommodate requests stated on this thread).

MattBlissett commented 4 years ago

3-letter ISO codes look like a repeat of previous mistakes.

What were those mistakes? Or, what are our requirements?

Representing the many languages for which we have vernacular names. This can cover every language (major, minority, differences between countries etc).
- This rules out an enum, it won't fit. That doesn't matter; we hardly ever refer to the languages in code. The ones we do can have a constant defined.
Representing the languages we have in the registry, for dataset descriptions etc, which is a much smaller set, but still requires the country (zh-CN, zh-TW)
Representing the languages the portal is translated to. Aligning with Crowdin would be helpful, otherwise including a mapping from these.
Serializing the result into something reasonable for users of the checklist, registry and portal APIs.

Locale(LanguageCode, CountryCode), would cover these, except:

Adding a ScriptCode would allow for when a language is written in different scripts, e.g. kaz-KZ-Cyrl and kaz-KZ-Latn. That might not be necessary, as the situation where it would be used (vernacular names) already accepts multiple values: hbs-SR-Cyrl: голуб hbs-SR-Latn: golub.
Where a language is written in different ways but not according to countries. Mentioned only because I see an IETF language tag would then describe the GBIF/UN-style language, en-GB-oxendict.

If, alternatively, this is only for a small number of languages we choose to support, then an enum with 8-10 values seems reasonable.

marcos-lg commented 4 years ago

Summarizing and if I understood correctly, looks like the Locale(LanguageCode, CountryCode) option is preferred? Looks to me that it's useful to keep the current Language enum for the places where we need the 3-letter ISO codes. And the Locale option is actually an extension of it just to cover a different use case. If at some point we need something that is not covered by this Locale enum (the 2 cases Matt mentioned) we'll see how we do it.

So I think we can do this next:

I convert this new enum into a Locale(LanguageCode, CountryCode)
We clean it a little and remove languages that we presumably won't need.
I create a PR with these changes and you guys review it or do changes in there (I will need help with the previous step).

Please comment if I've missed something or you disagree with something.

MattBlissett commented 4 years ago

This is the kind of thing I had in mind:

public class LanguageCode {
  private final String code2, code3, englishName;
  - Always use three-letter codes to serialize
  public static fromString(String code) ...
  - Validate and cache based on the ISO list Markus posted

Then we need a "Locale", except since there's java.util.Locale I think we should pick a different name. How about LanguageRegion?

public class LanguageRegion {
  private final LanguageCode languageCode;
  private final Optional<Country> region;
  - Serialize using the IETF form, i.e. prefer the two-letter language code if it exists.
  - It's possible to create "es_JP" or whatever, I don't think deciding what's valid is this class's issue.
  private static final EN = ... // if it's useful to have these in code
  private static final ES = ...
  ... fromString(String code)
  ... fromString(language, region)

}

The current Language can then be deprecated.

marcos-lg commented 4 years ago

Ok so I understand you mean to load all languages at startup form the file Markus posted (as they do in CoL)? And for the LanguageRegion we also have to do the same in order to know all the possible language-country combinations. Probably everything has to be in the same file since the LanguageRegion depends on the LanguageCode, so that file as it is now is not valid for us.

Also, since they are not enums anymore we'd also have to do some changes in the http://api.gbif.org/v1/enumeration endpoint to accommodate these classes (probably agreed with the front-end developers).

As a long-term solution it looks good but requires some time, specially to come up with the file with all the possible combinations.

Since this is blocking the vocabulary from starting the import and curation of vocabularies, I suggest that we move this issue to the gbif-api and I move the TranslationLanguage enum from gbif-api to the vocabulary project and I remove some weird languages. As long as the new future implementation uses the same serialization it's no problem to change the vocabulary to use different classes.

Does this make sense to you?

MattBlissett commented 4 years ago

Since this is blocking the vocabulary from starting the import and curation of vocabularies, I suggest that we move this issue to the gbif-api and I move the TranslationLanguage enum from gbif-api to the vocabulary project and I remove some weird languages. As long as the new future implementation uses the same serialization it's no problem to change the vocabulary to use different classes.

Yes, that's fine for the moment. It gives more time to consider how the API should handle languages.

marcos-lg commented 4 years ago

I moved the enum and renamed it: https://github.com/gbif/vocabulary/blob/master/model/src/main/java/org/gbif/vocabulary/model/enums/LanguageRegion.java

Also created endpoint in the vocabulary to retrieve the values: http://api.gbif-dev.org/v1/vocabularyLanguage

It's only deployed in DEV for now.

Changes can be done due to UI needs or if more language cleaning is required.

marcos-lg commented 4 years ago

I close this and this discussion can be continued in https://github.com/gbif/gbif-api/issues/51

gbif / vocabulary

language codes #32