facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

the new LID Model lid218e.bin doesn't detect Chinese/Japanese correctly. #5438

Open jingkang99 opened 7 months ago

jingkang99 commented 7 months ago

🐛 Bug

Compared to lid.176.bin, the new model isn't better. Where is the training data? Thx

To Reproduce

wget -q -k https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
wget -q -k https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
wget -q -k https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin
# -*- coding: utf-8 -*-
import fasttext

lang_model_small = "lid.176.ftz"
lang_model_large = "lid.176.bin"
lang_model_pnllb = "lid218e.bin"

class LanguageIdentification:
    def __init__(self):
        self.model = fasttext.load_model(lang_model_pnllb)
    def predict_lang(self, text):
        predictions = self.model.predict(text, k=2) # returns top 2 matching languages
        return predictions

LANGUAGE = LanguageIdentification()

print(LANGUAGE.predict_lang("北京大观园"))
print(LANGUAGE.predict_lang("陣陣秋風處處地奔走"))
print(LANGUAGE.predict_lang("音楽イベント"))
print(LANGUAGE.predict_lang("加州继续迎来灾难性降雨"))
print(LANGUAGE.predict_lang("This is a sample text in English"))

print(LANGUAGE.predict_lang("加"))
print(LANGUAGE.predict_lang("風"))

Expected behavior

$${\color{red}wrong-output}$$

(('__label__vie_Latn', '__label__bod_Tibt'), array([0.38840234, 0.25088209]))
(('__label__arb_Arab', '__label__bod_Tibt'), array([0.24475221, 0.21423988]))
(('__label__ces_Latn', '__label__slk_Latn'), array([0.51465511, 0.43502358]))
(('__label__eng_Latn', '__label__azb_Arab'), array([0.17568642, 0.16173792]))
(('__label__eng_Latn', '__label__kor_Hang'), array([1.00000989e+00, 1.00869729e-05]))
(('__label__bod_Tibt', '__label__eng_Latn'), array([0.77169538, 0.22752361]))
(('__label__eng_Latn', '__label__bod_Tibt'), array([0.53909516, 0.45798311]))

lang_model_large = "lid.176.bin" detected correctly $${\color{green}correct-output}$$

(('__label__zh', '__label__ja'), array([0.99504185, 0.00500183]))
(('__label__zh', '__label__fr'), array([0.79055512, 0.09689531]))
(('__label__ja', '__label__zh'), array([0.98412997, 0.0159294 ]))
(('__label__zh', '__label__ja'), array([0.97146052, 0.02057556]))
(('__label__en', '__label__bn'), array([0.97566205, 0.0020514 ]))
(('__label__zh', '__label__ja'), array([0.99229074, 0.00777919]))
(('__label__ja', '__label__zh'), array([0.82789904, 0.17213364]))

Environment

colab

quasoft commented 6 months ago

Had the same problem with the new model - not detecting text in Chinese (and some other languages using extended Unicode). Was wondering if the new model expects input in a specific encoding (tried with UTF-8, which works only with previous model).

It is also not detecting Chinese even on the huggingface demo: https://huggingface.co/facebook/fasttext-language-identification

Found another issue for the same problem: https://github.com/facebookresearch/fairseq/issues/5325

Did you find any workaround @jingkang99?