Open atreyasha opened 1 year ago
Just to add, the HF tokenizer for sentence-transformers/msmarco-roberta-base-ance-firstp
does not perform lowercasing:
from transformers import AutoTokenizer
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-roberta-base-ance-firstp")
print(tokenizer(["What is this?"]))
# >>> {'input_ids': [[0, 2264, 16, 42, 116, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}
# decodes to "What is this?"
Description
I was experimenting with the
sentence-transformers/msmarco-roberta-base-ance-firstp
model and observed some discrepancies between the outputs of the tokenizer depending on how the tokenizer was called. See example below:It appears that calling
roberta_ance.tokenize
forces lowercasing compared toroberta_ance.tokenizer
. I confirmed that this was not the case with the base RoBERTa model:Is this intended behaviour?
Environment
sentence-transformers
==2.2.2