How to load and use the model?

jingmouren commented 4 years ago

can't loaded by model = MarianMTModel.from_pretrained('Helsinki-NLP/zho-eng') like 'Helsinki-NLP/opus-mt-en-ROMANCE' listed in huggingface/transformers page

jorgtied commented 4 years ago

This model is not yet integrated in hugginface. The models here are the original MarianNMT models and you can use them with the marian-decoder. The conversion to pyTorch and hugginface will come in the near future ...

avostryakov commented 4 years ago

@jingmouren Actually, you can use it right now with the following steps:

download zip archive, unpack
convert model's files with https://github.com/huggingface/transformers/blob/master/src/transformers/convert_marian_to_pytorch.py to be ready for transformers library

use a model from a converted folder:

from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("converted-spa-en")
model = AutoModelWithLMHead.from_pretrained("converted-spa-en")
batch = tokenizer.prepare_translation_batch(src_texts=sents)
gen = model.generate(**batch)  # for forward pass: model(**batch)
translated = tokenizer.batch_decode(gen, skip_special_tokens=True)

djstrong commented 3 years ago

A little updated code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

import torch
batch = tokenizer.prepare_seq2seq_batch(src_texts=["Alice has a cat."])
batch['input_ids']=torch.tensor(batch['input_ids'])
batch['attention_mask']=torch.tensor(batch['attention_mask'])

gen = model.generate(**batch)

translated = tokenizer.batch_decode(gen, skip_special_tokens=True)
translated

but for me it haven't generated any text after model conversion.

velocityCavalry commented 3 years ago

A little updated code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

import torch
batch = tokenizer.prepare_seq2seq_batch(src_texts=["Alice has a cat."])
batch['input_ids']=torch.tensor(batch['input_ids'])
batch['attention_mask']=torch.tensor(batch['attention_mask'])

gen = model.generate(**batch)

translated = tokenizer.batch_decode(gen, skip_special_tokens=True)
translated

but for me it haven't generated any text after model conversion.

Do you mind sharing the command you use for converting? For some reason, I cannot convert opus-mt-he-en :( no matter how.

Lauler commented 3 years ago

In theory this should work

python -m transformers.models.marian.convert_marian_to_pytorch --src my_source_folder --dest my_destination_folder

And indeed it does convert the model and output it to the destination folder. After this I am able to read the model into pytorch with

tokenizer = AutoTokenizer.from_pretrained(my_destination_folder)
model = AutoModelForSeq2SeqLM.from_pretrained(my_destination_folder)

However, similarly to @djstrong , it doesn't actually generate any text. If you go into the destination folder and inspect the file tokenizer_config.json you will notice the conversion script has failed in logging the source and target languages.

Presumably we may need to run convert_marian_tatoeba_to_pytorch.py instead. However, after cloning the Tatoeba-Challenge repository and trying to use that conversion script, I am just met with the following error:

KeyError: 'pre-processing'

This is because it tried parsing preprocessing details from a README.md file and presumably something failed there for the language pair I supplied as arguments (eng-swe). I don't really have the energy to trouble-shoot this issue. It would be nice if someone could give a working example instead of expect users to read source code to understand which files are supposed to be located in which folders for this thing to actually work.

Helsinki-NLP / Tatoeba-Challenge

How to load and use the model? #2