Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
323 stars 40 forks source link

Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model #81

Open regpath opened 2 years ago

regpath commented 2 years ago

The translation result from English to Korean using the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model does not make sense at all

from transformers import MarianMTModel, MarianTokenizer
src_text = [
    "2, 4, 6 etc. are even numbers.",
    "Yes."
]

tokenizer = MarianTokenizer.from_pretrained(MODEL_PATH3)
model = MarianMTModel.from_pretrained(MODEL_PATH3)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

The result is not ['2, 4, 6 등은 짝수입니다.', '그래'] as in the example, but ['그들은,우리는,우리는 모자입니다. 신뢰할 수 있습니다.', 'ATP입니다.'] which does not make sense at all.

I tried some more sentences and believe that correct tokenizer or vocab file can correct this problem. Could you take a look at it?