cisnlp / GlotLID

GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
https://arxiv.org/abs/2310.16248
Apache License 2.0
84 stars 7 forks source link

Which ISO format? #4

Closed Vitaly-Protasov closed 5 months ago

Vitaly-Protasov commented 5 months ago

Hello, in the description you mention that you use "three-letter ISO codes with script)". But which exactly format you use? Could you please provide the specific ISO format you rely on, please?

kargaranamir commented 5 months ago

Hi @Vitaly-Protasov,

Three-letter ISO codes:

ISO 639-3: https://iso639-3.sil.org

For example, the current GlotLID label for English is eng_Latn: https://iso639-3.sil.org/eng.

We have also provided all the ISO 639-3 codes and the names of the languages here: https://github.com/cisnlp/GlotLID/blob/main/assets/language_names.json.

Script: Also, for the script (for example, Latn, Arab), we use ISO 15924: https://en.wikipedia.org/wiki/ISO_15924.

Vitaly-Protasov commented 5 months ago

@kargaranamir Thanks!

kargaranamir commented 5 months ago

You're welcome.