Closed amstu2 closed 5 months ago
Hi @amstu2,
Thanks for bringing up the issue.
Do you know of a source for mapping all the languages available in ISO 639-3?
Wouldn't it be easier to have a function that maps the labels of GlotLID to another format, like BCP-47, instead of retraining the model for these mappings?
I'm aware of these language tags: https://huggingface.co/datasets/lbourdois/language_tags Based on the BCP-47 description, we should check if the ISO 639-1 code is available for a language for which we already have an ISO 639-3 code, and then use that instead. However, I'm unsure about the country part for now. Regarding the writing system, it's unclear when we should mention the writing script and when we should not.
Firstly, thanks to the authors for releasing the model on GitHub 🙂!
I'm currently benchmarking different language identification models. As highlighted in Section 5.2 of the GlotLID paper, matching the language code metadata between models and datasets can be difficult.
Given that BCP-47 is backwards compatible with ISO-639-3 and is the IETF's best practice for language codes, would it be possible to have an alternative version of GlotLID that uses BCP-47 labels?
Thanks!