Tatoeba models outputting nonsense

Latrolage commented 1 year ago

On the huggingface demo, (e.g. https://huggingface.co/Helsinki-NLP/opus-tatoeba-en-ja?text=My+name+is+Wolfgang+and+I+live+in+Berlin) the output doesn't seem to make sense.

I ran some models locally too and this was the result of:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
input_text = "犬が好きじゃない"
print("Text to translate: "+ input_text)
print("Expected translation: I don't like dogs/I dislike dogs")
for folder in "../opus-2020-06-17-pytorch", "../opus-2021-02-18-pytorch", "Helsinki-NLP/opus-mt-ja-en", "Helsinki-NLP/opus-mt-jap-en":
    print(folder)
    tokenizer = AutoTokenizer.from_pretrained(folder)
    model = AutoModelForSeq2SeqLM.from_pretrained(folder)
    tokenized = tokenizer([input_text], return_tensors='pt')
    out = model.generate(**tokenized, max_length=128)
    print(tokenizer.decode(out[0], skip_special_tokens=True))

Output:

❯ python translate.py
Text to translate: 犬が好きじゃない
Expected translation: I don't like dogs/I dislike dogs
../opus-2020-06-17-pytorch
□ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □
../opus-2021-02-18-pytorch
pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain pain mountain mountain mountain mountain mountain mountain mountain mountain mountain eighteenth eighteenth eighteenth eighteenth eighteenth king king king king king king king king king king king king king king king king king eighteenth king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king king
Helsinki-NLP/opus-mt-ja-en
I don't like dogs.
Helsinki-NLP/opus-mt-jap-en
A dog's dogs would desire a dogs a dog would desire.

only opus-mt-ja-en gave an answer which was understandable at all. Any idea what the problem might be? The opus-mt-jap-en model also doesn't make a comprehensible translation.

The tatoeba models were converted to pytorch through python -m transformers.models.marian.convert_marian_to_pytorch --src folder --dest folder-pytorch I'm not sure how just pasting in the huggingface link loads it so I don't know how to replicate it.

droussis commented 1 year ago

This seems to be the case with all their models which originate from Tatoeba Challenge. Only the models which are included here seem to work using Hugging Face. Up until a month ago, I hadn't encountered such problems.

ArthurZucker commented 1 year ago

Thanks for reporting, I'll try to check if the tokenizer or the model is wrong.

ArthurZucker commented 1 year ago

Hey! you should use model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-tatoeba-en-ja", revision = "refs/pr/3"). This is indeed related to an update on the lib, but a fix was opened on all of the models online, like the following: https://huggingface.co/Helsinki-NLP/opus-tatoeba-en-ja/discussions/3

Latrolage commented 10 months ago

Are the opus-mt-xx-xx models a different issue? I tried just now on both old and newer transformers and haven't gotten them to work. https://huggingface.co/Helsinki-NLP/opus-mt-jap-en?text=%E7%8A%AC%E3%81%8C%E5%A5%BD%E3%81%8D%E3%81%98%E3%82%83%E3%81%AA%E3%81%84

jorgtied commented 10 months ago

Note that jap is not Japanese

Latrolage commented 10 months ago

That makes more sense. I also tried the model at https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/jpn-eng opus-2021-02-18 and it seems that my issue there is related to https://github.com/Helsinki-NLP/Tatoeba-Challenge/issues/2#issuecomment-867928524

Helsinki-NLP / Tatoeba-Challenge

Tatoeba models outputting nonsense #35