cisnlp / GlotLID

GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
https://arxiv.org/abs/2310.16248
Apache License 2.0
84 stars 7 forks source link

Modify to output BCP-47 labels? #2

Closed amstu2 closed 5 months ago

amstu2 commented 8 months ago

Firstly, thanks to the authors for releasing the model on GitHub 🙂!

I'm currently benchmarking different language identification models. As highlighted in Section 5.2 of the GlotLID paper, matching the language code metadata between models and datasets can be difficult.

Given that BCP-47 is backwards compatible with ISO-639-3 and is the IETF's best practice for language codes, would it be possible to have an alternative version of GlotLID that uses BCP-47 labels?

Thanks!

kargaranamir commented 7 months ago

Hi @amstu2,

Thanks for bringing up the issue.

Do you know of a source for mapping all the languages available in ISO 639-3?

Wouldn't it be easier to have a function that maps the labels of GlotLID to another format, like BCP-47, instead of retraining the model for these mappings?

I'm aware of these language tags: https://huggingface.co/datasets/lbourdois/language_tags Based on the BCP-47 description, we should check if the ISO 639-1 code is available for a language for which we already have an ISO 639-3 code, and then use that instead. However, I'm unsure about the country part for now. Regarding the writing system, it's unclear when we should mention the writing script and when we should not.