Helsinki-NLP / Tatoeba-Challenge

Other
808 stars 91 forks source link

Batch-mode prediction #8

Closed antoine-isnardy-danone closed 1 year ago

antoine-isnardy-danone commented 3 years ago

Hi,

Thank you for providing these tremendous resources. I'm currently trying to leverage the models that were uploaded to Hugginface (this one e.g.)

Is it expected not to be able to tokenize/generate in a batch-mode fashion? See below an example:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
inputs = tokenizer.encode("mango manzana y pera", return_tensors="pt")
inputs

tensor([[34090, 29312, 11, 306, 75, 0]])

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
inputs = tokenizer.encode(["mango manzana y pera"], return_tensors="pt")
inputs

tensor([[1, 0]])

Qwert567777 commented 3 years ago

Fucjivu

jorgtied commented 3 years ago

I am not sure how compatible the tokenizers from huggingface are with the SentencePiece unigram models that we provide for the models here that have been converted to their interfaces. This would be a question to ask at huggingface. Good luck!