Open Matthieu-Tinycoaching opened 3 years ago
Hi @Matthieu-Tinycoaching I am also a bit confused. Indeed, a BertTokenizerFast is loaded, however, this tokenizer produces the same tokens as the xlm-roberta-base tokenizer:
from transformers import AutoTokenizer
tok1 = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
tok2 = AutoTokenizer.from_pretrained("xlm-roberta-base")
text = "世界你好,你今天好吗?"
print(type(tok1))
print(tok1(text))
print(type(tok2))
print(tok2(text))
The input_ids are identical, only difference is that the BERTTokenizer outputs also token_type_ids with all zeros (which is also the default value of the model for token_type_ids).
So it appears that the BERTTokenizerFast-Class is inter changeable with the XLMRobertaTokenizerFast class.
Hi @nreimers thanks for your feedback.
Between the two tokenizers, would XLMRobertaTokenizerFast be faster than BERTTokenizerFast-Class since it doesn't output token_type_ids?
I think this will not make a difference at all.
Hi,
When I load the the tokenizer of
paraphrase-multilingual-MiniLM-L12-v2
model via the hugging face model hub:It seems that it's a fast BERT tokenizer:
<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
But I am a bit confused because when I look at the base model: https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384
it clearly states:
Which tokenizer should I use then with the
paraphrase-multilingual-MiniLM-L12-v2
model and how to load it?Thanks!