Some language-specific models are not translating multi-sentence sequences

BramVanroy commented 2 years ago

Hello

I put together a quick demo for using the open-source Opus MT models from the Hugging Face hub. I quickly found that for some languages, the model is not translating all sentences. You can reproduce the same issue when loading the models straight in Python, e.g., with these functions

from typing import Optional, Tuple

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, PreTrainedModel, PreTrainedTokenizer

def load_mt_pipeline(model_name: str) -> Optional[Tuple[PreTrainedModel, PreTrainedTokenizer]]:
    """Load an opus-mt model, download it if it has not been installed yet."""
    try:
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        return model, tokenizer
    except:
        return None

def translate(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, src_text: str) -> str:
    translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
    translations = "".join([tokenizer.decode(tokens, skip_special_tokens=True) for tokens in translated])
    return translations

In the demo, you'll see that the default-selected model Helsinki-NLP/opus-mt-en-nl is used to translate the English sentences Grandma is baking cookies! I love her cookies.. Unfortunately, the model only seems to translate the first one into Oma bakt koekjes.. The second part I love her cookies. is not translated.

I verified that the tokenizer is correctly tokenizing the input, but it seems that generate is not correctly producing all the input. It is stopping prematurely. The issue does not occur for, e.g., Helsinki-NLP/opus-mt-en-fr.

Any thoughts on this?

jorgtied commented 2 years ago

To be honest I don't know very much about the integration into the transformers library but could the problem be that the input is not split into sentences? The models are trained to translate individual sentences (in rare cases a couple of sentences or maybe three short ones) and not complete texts.

BramVanroy commented 2 years ago

Thanks for the quick reply! This was indeed an explanation that Leandro von Werra of HF suggested. The transformers library does not do sentence splitting as a preprocessing step, so I'll have to adjust for that.

Are all opus-mt models trained on parallel sentences? If so I'll add a sentence-splitter pre-processing step to the demo.

jorgtied commented 2 years ago

Yes, they are. We don't have document-level models yet.

Helsinki-NLP / Opus-MT

Some language-specific models are not translating multi-sentence sequences #60