Closed BramVanroy closed 2 years ago
To be honest I don't know very much about the integration into the transformers library but could the problem be that the input is not split into sentences? The models are trained to translate individual sentences (in rare cases a couple of sentences or maybe three short ones) and not complete texts.
Thanks for the quick reply! This was indeed an explanation that Leandro von Werra of HF suggested. The transformers library does not do sentence splitting as a preprocessing step, so I'll have to adjust for that.
Are all opus-mt models trained on parallel sentences? If so I'll add a sentence-splitter pre-processing step to the demo.
Yes, they are. We don't have document-level models yet.
Hello
I put together a quick demo for using the open-source Opus MT models from the Hugging Face hub. I quickly found that for some languages, the model is not translating all sentences. You can reproduce the same issue when loading the models straight in Python, e.g., with these functions
In the demo, you'll see that the default-selected model
Helsinki-NLP/opus-mt-en-nl
is used to translate the English sentencesGrandma is baking cookies! I love her cookies.
. Unfortunately, the model only seems to translate the first one intoOma bakt koekjes.
. The second partI love her cookies.
is not translated.I verified that the tokenizer is correctly tokenizing the input, but it seems that
generate
is not correctly producing all the input. It is stopping prematurely. The issue does not occur for, e.g.,Helsinki-NLP/opus-mt-en-fr
.Any thoughts on this?