[working-draft] Public domain datasets from Translation Initiative for COVID-19 on the format HXLTM (Multilingual Terminology in Humanitarian Language Exchange)
Except by the Google datasets (which they explicitly explain usage of BCP-47 (https://tools.ietf.org/html/bcp47), and at the moment there is no know potential error) , both the work from Translators Without Borders (which may been imported to some centralized tool before exported to the TICO-19 repository, so this may actually not be a mistake from TWB) and the datasets provide by Facebook have some non-standard language codes.
The ideal
One approach here is also use BCP-47 on the work already not transformed to HXLTM. The way we encode the HXL attributes also adds ISO 369-3 and ISO 15924, but we need some better starting point.
This both apply for content on the CSVs and TMXs, and the filenames.
About the redundant country names
Some country names seems to be relevant, but some are overlong. This is different from normalize the language codes, since we cannot either simply remove the countries. Also, countries which seems to not have countries associated to then, with suffix XX tend to be exactly the ones who may have bigger variation, so for example the translated wordlists from Facebook seems to be overlong on languages that are spoken mostly on a single country, while omitting on the ones that actually have more variation.
On this point, we could try to do some checks based on how the Unicode CLDR would consider it redundant the country codes.
Also, while I'm citing the Facebook datasets, since 3 big collaborators (Google, Facebook, TWB) use different language codes, they may actually be using what is common inside their company. So, except by cases which may be #4, this normalization step would need to be done after datasets from several collaborators are distributed on initiatives like TICO-19 on future.
Note: this issue is different from the #4.
Except by the Google datasets (which they explicitly explain usage of BCP-47 (https://tools.ietf.org/html/bcp47), and at the moment there is no know potential error) , both the work from Translators Without Borders (which may been imported to some centralized tool before exported to the TICO-19 repository, so this may actually not be a mistake from TWB) and the datasets provide by Facebook have some non-standard language codes.
The ideal
One approach here is also use BCP-47 on the work already not transformed to HXLTM. The way we encode the HXL attributes also adds ISO 369-3 and ISO 15924, but we need some better starting point.
This both apply for content on the CSVs and TMXs, and the filenames.
About the redundant country names
Some country names seems to be relevant, but some are overlong. This is different from normalize the language codes, since we cannot either simply remove the countries. Also, countries which seems to not have countries associated to then, with suffix
XX
tend to be exactly the ones who may have bigger variation, so for example the translated wordlists from Facebook seems to be overlong on languages that are spoken mostly on a single country, while omitting on the ones that actually have more variation.On this point, we could try to do some checks based on how the Unicode CLDR would consider it redundant the country codes.
Also, while I'm citing the Facebook datasets, since 3 big collaborators (Google, Facebook, TWB) use different language codes, they may actually be using what is common inside their company. So, except by cases which may be #4, this normalization step would need to be done after datasets from several collaborators are distributed on initiatives like TICO-19 on future.