Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
323 stars 40 forks source link

model Helsinki-NLP/opus-mt-en-uk translates some sentences into Russian instead of Ukrainian #66

Open shyrma opened 2 years ago

shyrma commented 2 years ago

code to reproduce:

model_name = "Helsinki-NLP/opus-mt-en-uk" model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name) batch = tokenizer(["What are you doing?", "Good news for you."], return_tensors="pt", padding=True) gen = model.generate(**batch) result = tokenizer.batch_decode(gen, skip_special_tokens=True) print(result)

Expected output (Ukranian): ['Що ти робиш?', 'Гарні новини для тебе.']

Actual output (Russian): ['Что ты делаешь?', 'Хорошая новость для тебя.']

jorgtied commented 2 years ago

It could be that the training data is not very clean and includes quite a bit of Russian. Would you know if there are lots of problems with the data in OPUS? If you happen to see a lot of noise in any of the underlying data collections then, please, let me know. Thanks!