Open kauttoj opened 2 years ago
Yes, that looks a bit weird. The model at huggingface does not seem to handle that kind of input well. At least a newer OPUS-MT model does not do that anymore. You can try it here: https://translate.ling.helsinki.fi/ui/memad It should be from this model: https://object.pouta.csc.fi/Tatoeba-MT-models/eng-fin/opusTCv20210807+bt-2021-12-08.zip
Thanks for the reply. I was able to solve the problem by using the new Tatoeba model.
Just in case someone has the same problem, just follow these instructions to convert Tatoeba models into Hugginface format: https://github.com/huggingface/transformers/tree/master/scripts/tatoeba
Then you can use the model with this code (copied from here):
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_MODEL)
# Initialize the model
model = AutoModelForSeq2SeqLM.from_pretrained(PATH_TO_CONVERTED_MODEL)
# Tokenize text
text = "Hello my friends! How are you doing today?"
tokenized_text = tokenizer.prepare_seq2seq_batch([text], return_tensors='pt')
# Perform translation and decode the output
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
# Print translated text
print(translated_text)
PS. Conversion worked only for "eng-fin" model, while "fin-eng" failed because of some dimension mismatch error: "raise ValueError(f"Hidden size {hidden_size} and configured size {cfg['dim_emb']} mismatched or not 512") KeyError: 'dim_emb'"
While translating English to Finnish using your model via EasyNMT, I noticed something weird. Check this code and the results.
The output is:
So "=== Inclusions" is translated into "Suomennos: Michael T. Francis Pinmontagne SUBHEAVEN.ORG".
What is going on here? Is this a problem with Opus-MT model or its EasyMT implementation?
PS. The sample text is from ESCO ontology