UKPLab / EasyNMT

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages
Apache License 2.0
1.14k stars 108 forks source link

Chinese language variants #45

Open fmichaelkunz opened 3 years ago

fmichaelkunz commented 3 years ago

EasyNMT uses fasttext to identify language. Some chinese phrases can be misidentified into chinese variants, like 'yue' or 'wuu'. This will cause easyNMT to fail. Can you map Chinese language variants to Chinese. 'yue', 'wuu', 'min' to 'zh'?

nreimers commented 3 years ago

Hi @fmichaelkunz thanks for pointing this out.

When you know the language, you can always provide the source_lang parameter. In that case, no automatic language detection is performed.

Otherwise you can overwrite the lang detector:

def my_lang_detection(text):
   lang = model.language_detection_fasttext(text)
   if lang in ['yue', 'wuu', 'min']:
          lang = 'zh'
   return lang

model = EasyNMT(...)
model._lang_detectors = [my_lang_detection]