Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

Issue with it-en model #52

Open avostryakov opened 3 years ago

avostryakov commented 3 years ago

In some rare situations, specific sentences translated from Italian to the English language with "(Translated with Google Translate)" at the end of the output sentence. For example, the following Italian sentences will have it in English: Gli oggetti ordinati sono arrivati in tempi piu'che rapidi e tutto anche piu'bello dal vivo....perfetto!! Ho fatto alcuni acquistati da Mano Mano mi sono arrivati in tempi brevi e senza alcun problema,auguri grazie di ❤️!!!! grazie. fino ad adesso buoni prodotti... speriamo anche il prossimo!

I guess in training data there is this kind of rubbish in translation pairs in some datasets. I suggest removing "(Translated with Google Translate)" from English sentences from training data in preprocessing pipeline.

I'm talking about this model: models/it-en/

jorgtied commented 3 years ago

Thanks a lot for the feedback. I'll try to add a filter. Is it a big and frequent problem? I cannot promise to update all models immediately. Please, let me know if you see further issues that we could address in new releases. In the meantime: Could you check whether this model has the same problem: https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/ita-eng Thanks!

avostryakov commented 3 years ago

Is it a big and frequent problem?

No, It's seldom. And If I remove "!" from the end of sentences it disappears from English output. I let you know if we will find this issue in other models.

avostryakov commented 3 years ago

@jorgtied We didn't find this problem with other models that we use from Helsinki-NLP/Tatoeba-Challenge projects: por-eng, fr-en, spa-eng, fi-en, da-en, nl-en, so-en, no-en, el-en (maybe nn-en/nb-en, I don't remember exactly now). Only the It-En model.