Tokenizer of paraphrase-multilingual-MiniLM-L12-v2

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

15.25k stars 2.47k forks source link

Tokenizer of paraphrase-multilingual-MiniLM-L12-v2 #1010

Open Matthieu-Tinycoaching opened 3 years ago

Matthieu-Tinycoaching commented 3 years ago

Hi,

When I load the the tokenizer of paraphrase-multilingual-MiniLM-L12-v2 model via the hugging face model hub:

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
print(type(tokenizer))

It seems that it's a fast BERT tokenizer: <class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>

But I am a bit confused because when I look at the base model: https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384

it clearly states:

Please note: This checkpoint uses BertModel with XLMRobertaTokenizer so AutoTokenizer won't work with this checkpoint!

Which tokenizer should I use then with the paraphrase-multilingual-MiniLM-L12-v2 model and how to load it?

Thanks!

nreimers commented 3 years ago

Hi @Matthieu-Tinycoaching I am also a bit confused. Indeed, a BertTokenizerFast is loaded, however, this tokenizer produces the same tokens as the xlm-roberta-base tokenizer:

from transformers import AutoTokenizer
tok1 = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
tok2 = AutoTokenizer.from_pretrained("xlm-roberta-base")
text = "世界你好，你今天好吗？"
print(type(tok1))
print(tok1(text))
print(type(tok2))
print(tok2(text))

The input_ids are identical, only difference is that the BERTTokenizer outputs also token_type_ids with all zeros (which is also the default value of the model for token_type_ids).

So it appears that the BERTTokenizerFast-Class is inter changeable with the XLMRobertaTokenizerFast class.

Matthieu-Tinycoaching commented 3 years ago

Hi @nreimers thanks for your feedback.

Between the two tokenizers, would XLMRobertaTokenizerFast be faster than BERTTokenizerFast-Class since it doesn't output token_type_ids?

nreimers commented 3 years ago

I think this will not make a difference at all.