antonisa / lang2vec

A simple library for querying the URIEL typological database.
Creative Commons Attribution Share Alike 4.0 International
88 stars 16 forks source link

Inconsistency with 'alb' lang in LEARNED_LANGUAGES list #2

Open aoncevay opened 5 years ago

aoncevay commented 5 years ago

I've identified that the language 'alb' doesn't have learned features, but both LEARNED_LANGUAGES and available_learned_languages() lists report the opposite. A simple test:

import lang2vec.lang2vec as l2v
l = 'alb'
try:
    l2v.get_features(l, "learned")
except:
    print(False)
print(l in l2v.available_learned_languages())
print(l in l2v.LEARNED_LANGUAGES)

Output:

False
True
True

As far as I know, this is the only inconsistency, or maybe there is something I'm not considering from the lists. If I find anything similar, I'll report it here as well

antonisa commented 5 years ago

Hi, Thanks for pointing the error out! I was able to reproduce it, and I traced the issue back to the following:

the URIEL database uses 'sqi' as the identifier for Albanian, so the code maps 'alb' to 'sqi'. However, the learned features use 'alb' for Albanian, so the lookup fails.

So: f = l2v.get_features('sqi', 'phonology_ethnologue') works f = l2v.get_features('alb', 'phonology_ethnologue') also works (cause both use 'sqi' for the lookup) but f = l2v.get_features('alb', 'learned') fails.

If you built your code from source, you can easily circumvent this (I think removing the "alb": "sqi" mapping from the letter_codes.json file should do it, but then you'd have to make sure to use "alb" for learned features and "sqi" for the others)

Unfortunately I cannot push a new version on pypi for now (due to the size of the library they asked that the updates were sparse) but I'll try to at least update the source on github.