Modify to output BCP-47 labels?

cisnlp / GlotLID

GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023

Apache License 2.0

84 stars 7 forks source link

Hi @amstu2,

Thanks for bringing up the issue.

Do you know of a source for mapping all the languages available in ISO 639-3?

Wouldn't it be easier to have a function that maps the labels of GlotLID to another format, like BCP-47, instead of retraining the model for these mappings?

I'm aware of these language tags: https://huggingface.co/datasets/lbourdois/language_tags Based on the BCP-47 description, we should check if the ISO 639-1 code is available for a language for which we already have an ISO 639-3 code, and then use that instead. However, I'm unsure about the country part for now. Regarding the writing system, it's unclear when we should mention the writing script and when we should not.

cisnlp / GlotLID

Modify to output BCP-47 labels? #2