Helsinki-NLP / Opus-MT

Open neural machine translation models and web services
MIT License
574 stars 71 forks source link

Weird results when translating english to finnish (using EasyNMT with opus-mt) #55

Open kauttoj opened 2 years ago

kauttoj commented 2 years ago

While translating English to Finnish using your model via EasyNMT, I noticed something weird. Check this code and the results.

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

text='''Religion and theology is the study of religious beliefs, concepts, symbols, expressions and texts of spirituality.
Programmes and qualifications with the following main content are classified here:
Religious history
Study of sacred books
Study of different religions
Theology
=== Inclusions
Included in this detailed field are programmes for children and young people.'''

print(model.translate(text,target_lang='fi'))

The output is:

'Uskonto ja teologia tutkivat uskonnollisia käsityksiä, käsitteitä, symboleja, ilmaisuja ja tekstejä hengellisyydestä.
Ohjelmat ja tutkinnot, joiden pääsisältö on seuraava:
Uskonnollinen historia
Pyhien kirjojen tutkiminen
Eri uskontojen tutkiminen
Teologia
Suomennos: Michael T. Francis Pinmontagne SUBHEAVEN.ORG
Tähän yksityiskohtaiseen kenttään kuuluvat lasten ja nuorten ohjelmat.'

So "=== Inclusions" is translated into "Suomennos: Michael T. Francis Pinmontagne SUBHEAVEN.ORG".

What is going on here? Is this a problem with Opus-MT model or its EasyMT implementation?

PS. The sample text is from ESCO ontology

jorgtied commented 2 years ago

Yes, that looks a bit weird. The model at huggingface does not seem to handle that kind of input well. At least a newer OPUS-MT model does not do that anymore. You can try it here: https://translate.ling.helsinki.fi/ui/memad It should be from this model: https://object.pouta.csc.fi/Tatoeba-MT-models/eng-fin/opusTCv20210807+bt-2021-12-08.zip

kauttoj commented 2 years ago

Thanks for the reply. I was able to solve the problem by using the new Tatoeba model.

Just in case someone has the same problem, just follow these instructions to convert Tatoeba models into Hugginface format: https://github.com/huggingface/transformers/tree/master/scripts/tatoeba

Then you can use the model with this code (copied from here):

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_MODEL)
# Initialize the model
model = AutoModelForSeq2SeqLM.from_pretrained(PATH_TO_CONVERTED_MODEL)
# Tokenize text
text = "Hello my friends! How are you doing today?"
tokenized_text = tokenizer.prepare_seq2seq_batch([text], return_tensors='pt')
# Perform translation and decode the output
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
# Print translated text
print(translated_text)

PS. Conversion worked only for "eng-fin" model, while "fin-eng" failed because of some dimension mismatch error: "raise ValueError(f"Hidden size {hidden_size} and configured size {cfg['dim_emb']} mismatched or not 512") KeyError: 'dim_emb'"