UKPLab / EasyNMT

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages
Apache License 2.0
1.18k stars 118 forks source link

Mistranslations - repeated output #38

Closed alexwilson1 closed 3 years ago

alexwilson1 commented 3 years ago

Hey team,

Thank you again for the great library.

Today we translated 'id' (Indonesian) sentences and quite a few of them came out as variants of "I'm sorry I'm sorry I'm sorry I'm sorry I'm sorry I'm sorry" even though they did not mention 'sorry' in the text.

Any idea why this could be please? Could it be because I'm not performing sentence splitting prior to translation whilst using the 'translate_sentences' function?

Thanks!

nreimers commented 3 years ago

This can sadly happen, it is called hallucination. There is sadly no easy fix. I found that this happens more often with the opus Mt model than with the other models, especially when the input is noisy / different from a clean and nice sentence.

alexwilson1 commented 3 years ago

Got it, thank you for letting me know! You are correct - it occurs most frequently when the input is malformed (e.g. mix of languages, non-standard punctuation) etc.

Perhaps language detection on a sentence level could help resolve this in some cases, and normalising punctuation (although this will be difficult for all languages). I'll try a couple of things out, but will close the issue for now. Thanks!