Open shyrma opened 2 years ago
It could be that the training data is not very clean and includes quite a bit of Russian. Would you know if there are lots of problems with the data in OPUS? If you happen to see a lot of noise in any of the underlying data collections then, please, let me know. Thanks!
code to reproduce:
model_name = "Helsinki-NLP/opus-mt-en-uk"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
batch = tokenizer(["What are you doing?", "Good news for you."], return_tensors="pt", padding=True)
gen = model.generate(**batch)
result = tokenizer.batch_decode(gen, skip_special_tokens=True)
print(result)
Expected output (Ukranian):
['Що ти робиш?', 'Гарні новини для тебе.']
Actual output (Russian):
['Что ты делаешь?', 'Хорошая новость для тебя.']