facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.83k stars 4.71k forks source link

Lookup tables for language labels - Python module #1304

Open RedactedCode opened 1 year ago

RedactedCode commented 1 year ago

I'm using the prebuilt lid.176.ftz model to do simple language ID on short texts (160 chars or fewer) using the Python module.

Is there a lookup table (dictionary) for the labels?

eg

{
    "en": "English", 
    "fr": "French",
     ...
}

Some of the labels fastText returns are quite obscure languages & I've had to trawl a lot of ISO-639 docs to establish what they refer to in order to build my own lookup table.

Or have I simply missed something in the docs /API that tells me how to get these?

CarlosGDCJ commented 1 year ago

langcodes seems to solve your problem: it implements BCP47, which is supposed to be backwards compatible with ISO-639 from what I understand. You can use it to create a map from language tags to their english names using Language.get(tag).display_name("en")

# pip install langcodes language_data
from langcodes import tag_is_valid, Language

tags = ["en", "fr", ...]
unique_tags = set(tags)
tags_map = {
    tag: Language.get(tag).display_name("en") if tag_is_valid(tag) else "Unknown"
    for tag in unique_tags
}