Open RedactedCode opened 1 year ago
langcodes seems to solve your problem: it implements BCP47, which is supposed to be backwards compatible with ISO-639 from what I understand. You can use it to create a map from language tags to their english names using Language.get(tag).display_name("en")
# pip install langcodes language_data
from langcodes import tag_is_valid, Language
tags = ["en", "fr", ...]
unique_tags = set(tags)
tags_map = {
tag: Language.get(tag).display_name("en") if tag_is_valid(tag) else "Unknown"
for tag in unique_tags
}
I'm using the prebuilt lid.176.ftz model to do simple language ID on short texts (160 chars or fewer) using the Python module.
Is there a lookup table (dictionary) for the labels?
eg
Some of the labels fastText returns are quite obscure languages & I've had to trawl a lot of ISO-639 docs to establish what they refer to in order to build my own lookup table.
Or have I simply missed something in the docs /API that tells me how to get these?