Closed jingmouren closed 4 years ago
This model is not yet integrated in hugginface. The models here are the original MarianNMT models and you can use them with the marian-decoder. The conversion to pyTorch and hugginface will come in the near future ...
@jingmouren Actually, you can use it right now with the following steps:
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("converted-spa-en")
model = AutoModelWithLMHead.from_pretrained("converted-spa-en")
batch = tokenizer.prepare_translation_batch(src_texts=sents)
gen = model.generate(**batch) # for forward pass: model(**batch)
translated = tokenizer.batch_decode(gen, skip_special_tokens=True)
A little updated code:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
import torch
batch = tokenizer.prepare_seq2seq_batch(src_texts=["Alice has a cat."])
batch['input_ids']=torch.tensor(batch['input_ids'])
batch['attention_mask']=torch.tensor(batch['attention_mask'])
gen = model.generate(**batch)
translated = tokenizer.batch_decode(gen, skip_special_tokens=True)
translated
but for me it haven't generated any text after model conversion.
A little updated code:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSeq2SeqLM.from_pretrained(model_path) import torch batch = tokenizer.prepare_seq2seq_batch(src_texts=["Alice has a cat."]) batch['input_ids']=torch.tensor(batch['input_ids']) batch['attention_mask']=torch.tensor(batch['attention_mask']) gen = model.generate(**batch) translated = tokenizer.batch_decode(gen, skip_special_tokens=True) translated
but for me it haven't generated any text after model conversion.
Do you mind sharing the command you use for converting? For some reason, I cannot convert opus-mt-he-en :( no matter how.
In theory this should work
python -m transformers.models.marian.convert_marian_to_pytorch --src my_source_folder --dest my_destination_folder
And indeed it does convert the model and output it to the destination folder. After this I am able to read the model into pytorch with
tokenizer = AutoTokenizer.from_pretrained(my_destination_folder)
model = AutoModelForSeq2SeqLM.from_pretrained(my_destination_folder)
However, similarly to @djstrong , it doesn't actually generate any text. If you go into the destination folder and inspect the file tokenizer_config.json
you will notice the conversion script has failed in logging the source and target languages.
Presumably we may need to run convert_marian_tatoeba_to_pytorch.py instead. However, after cloning the Tatoeba-Challenge repository and trying to use that conversion script, I am just met with the following error:
KeyError: 'pre-processing'
This is because it tried parsing preprocessing details from a README.md file and presumably something failed there for the language pair I supplied as arguments (eng-swe). I don't really have the energy to trouble-shoot this issue. It would be nice if someone could give a working example instead of expect users to read source code to understand which files are supposed to be located in which folders for this thing to actually work.
can't loaded by model = MarianMTModel.from_pretrained('Helsinki-NLP/zho-eng') like 'Helsinki-NLP/opus-mt-en-ROMANCE' listed in huggingface/transformers page