Closed MortenHofft closed 4 years ago
New enum is here: http://api.gbif-dev.org/v1/enumeration/basic/TranslationLanguage
Java code here: https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/vocabulary/TranslationLanguage.java
It's deployed in DEV
What's this for? (Meaning the API change more than the vocabulary using it.)
We have language codes for interpreting the language of a vernacular name, which will include minority and dead languages.
I don't know if that's the same thing as the languages we translate the portal / registry into.
@MortenHofft requested it to differentiate between language variants like the Chinese ones (our current Language enum doesn't support that). The vocabularies will also be used to for example populate dropdowns in the UI and the UI uses these variants.
The only reason I put it in gbif-api is for consistency for front-end developers to have this enum in the same endpoint as the others (http://api.gbif-dev.org/v1/enumeration/basic/TranslationLanguage). Should I move it?
I'm not sure if we should add a second language vocabulary to the v1 API. We already have one, and should consider how it might be extended.
It seems a bit arbitrary to choose Crowdin's list of supported languages. There's a mixture of two and three letter codes, a few without countries, and stuff like Upside Down English and "Quenya" which is Lord of the Rings Elvish.
We'll support these APIs and vocabularies/enumerations for years, so it's worth spending the time to get it right.
@timrobertson100, @mdoering, what do you think?
Yes, I feel similar. There is a prominent open issue in the GBIF API for some time about extending the existing but limited language enumeration: https://github.com/gbif/gbif-api/issues/29
For CoL we have the need to support a wide array of languages for vernacular names. We decided to drop the GBIF language enum and instead go with a large list of 3 letter iso codes (>8000) taken from https://iso639-3.sil.org/code_tables/download_tables. These do not fit into an enum anymore.
This does not solve the problem with simplified and traditional chinese though. These are seen as the same language but using different scripts. So you need a locale to distinguish them.
I think it wouldn't be so easy to extend the current Language
enum to accommodate the locales because we'd have to change the default serialization to use the locale instead of the 3-letter-code as we do now and this will break the code that relies on that.
One solution could be to rename this new enum to Locale
and don't store the ISO 639-1 and 639-3 codes. Actually, this enum could use the current Language
and Country
enums so we only support the languages and countries available in those enums (we could add more languages if needed). So I mean this:
Locale(Language lang, Country country) {}
and this Locale
will be serialized as something like Language.getIso2LetterCode-Country.get2LetterCode
(e.g.: en-US)
Since we can't fix Language
to work for us, I propose we consider marking Language
as deprecated with instructions to use a LanguageCode
. This is similar to the original proposal but with a naming change and following typical behavior for retiring something still in use.
If we did this, in deprecating we should state that Language
is expected to be removed in a v2 GBIF API and LanguageCode
can contain a mix of existing Language
codes plus the necessary subset of CrowdIn language codes to meet our foreseen needs - adding more in future releases as necessary.
3-letter ISO codes look like a repeat of previous mistakes.
The Locale
proposal looks likely to be limited in similar ways to 2 and 3 letter ISO codes (but I recognize the attempt to accommodate requests stated on this thread).
3-letter ISO codes look like a repeat of previous mistakes.
What were those mistakes? Or, what are our requirements?
Locale(LanguageCode, CountryCode)
, would cover these, except:
ScriptCode
would allow for when a language is written in different scripts, e.g. kaz-KZ-Cyrl
and kaz-KZ-Latn
. That might not be necessary, as the situation where it would be used (vernacular names) already accepts multiple values: hbs-SR-Cyrl: голуб
hbs-SR-Latn: golub
.en-GB-oxendict
.If, alternatively, this is only for a small number of languages we choose to support, then an enum with 8-10 values seems reasonable.
Summarizing and if I understood correctly, looks like the Locale(LanguageCode, CountryCode)
option is preferred? Looks to me that it's useful to keep the current Language
enum for the places where we need the 3-letter ISO codes. And the Locale
option is actually an extension of it just to cover a different use case. If at some point we need something that is not covered by this Locale
enum (the 2 cases Matt mentioned) we'll see how we do it.
So I think we can do this next:
Locale(LanguageCode, CountryCode)
Please comment if I've missed something or you disagree with something.
This is the kind of thing I had in mind:
public class LanguageCode {
private final String code2, code3, englishName;
- Always use three-letter codes to serialize
public static fromString(String code) ...
- Validate and cache based on the ISO list Markus posted
Then we need a "Locale", except since there's java.util.Locale
I think we should pick a different name. How about LanguageRegion
?
public class LanguageRegion {
private final LanguageCode languageCode;
private final Optional<Country> region;
- Serialize using the IETF form, i.e. prefer the two-letter language code if it exists.
- It's possible to create "es_JP" or whatever, I don't think deciding what's valid is this class's issue.
private static final EN = ... // if it's useful to have these in code
private static final ES = ...
... fromString(String code)
... fromString(language, region)
}
The current Language
can then be deprecated.
Ok so I understand you mean to load all languages at startup form the file Markus posted (as they do in CoL)? And for the LanguageRegion
we also have to do the same in order to know all the possible language-country combinations. Probably everything has to be in the same file since the LanguageRegion
depends on the LanguageCode
, so that file as it is now is not valid for us.
Also, since they are not enums anymore we'd also have to do some changes in the http://api.gbif.org/v1/enumeration endpoint to accommodate these classes (probably agreed with the front-end developers).
As a long-term solution it looks good but requires some time, specially to come up with the file with all the possible combinations.
Since this is blocking the vocabulary from starting the import and curation of vocabularies, I suggest that we move this issue to the gbif-api and I move the TranslationLanguage
enum from gbif-api to the vocabulary project and I remove some weird languages. As long as the new future implementation uses the same serialization it's no problem to change the vocabulary to use different classes.
Does this make sense to you?
Since this is blocking the vocabulary from starting the import and curation of vocabularies, I suggest that we move this issue to the gbif-api and I move the
TranslationLanguage
enum from gbif-api to the vocabulary project and I remove some weird languages. As long as the new future implementation uses the same serialization it's no problem to change the vocabulary to use different classes.
Yes, that's fine for the moment. It gives more time to consider how the API should handle languages.
I moved the enum and renamed it: https://github.com/gbif/vocabulary/blob/master/model/src/main/java/org/gbif/vocabulary/model/enums/LanguageRegion.java
Also created endpoint in the vocabulary to retrieve the values: http://api.gbif-dev.org/v1/vocabularyLanguage
It's only deployed in DEV for now.
Changes can be done due to UI needs or if more language cleaning is required.
I close this and this discussion can be continued in https://github.com/gbif/gbif-api/issues/51
We are currently using 3 letter language codes. That is not enough to describe all the languages we would like to support/describe. An example is zh-TW Chinese traditional/taiwanese. We already have the website translated into traditional Chinese - we do not want to loose this option.
So we need a new enumeration for languages (existing is here). It seems natural to look to Crowdin as they make a living from translations.