UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.84k stars 2.44k forks source link

RoBERTa ANCE FirstP forces lowercasing #1831

Open atreyasha opened 1 year ago

atreyasha commented 1 year ago

Description

I was experimenting with the sentence-transformers/msmarco-roberta-base-ance-firstp model and observed some discrepancies between the outputs of the tokenizer depending on how the tokenizer was called. See example below:

from sentence_transformers import SentenceTransformer

# load model
roberta_ance = SentenceTransformer("sentence-transformers/msmarco-roberta-base-ance-firstp")

print(roberta_ance.tokenize(["What is this?"]))
# >>> {'input_ids': tensor([[0, 12196, 16, 42, 116, 2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
# decodes to "what is this?"

print(roberta_ance.tokenizer(["What is this?"]))
# >>> {'input_ids': [[0, 2264, 16, 42, 116, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}
# decodes to "What is this?"

It appears that calling roberta_ance.tokenize forces lowercasing compared to roberta_ance.tokenizer. I confirmed that this was not the case with the base RoBERTa model:

from sentence_transformers import SentenceTransformer

# load model
roberta = SentenceTransformer("roberta-base")

print(roberta.tokenize(["What is this?"]))
# >>> {'input_ids': tensor([[0, 2264, 16, 42, 116, 2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
# decodes to "What is this?"

print(roberta.tokenizer(["What is this?"]))
# >>> {'input_ids': [[0, 2264, 16, 42, 116, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}
# decodes to "What is this?"

Is this intended behaviour?

Environment

sentence-transformers==2.2.2

atreyasha commented 1 year ago

Just to add, the HF tokenizer for sentence-transformers/msmarco-roberta-base-ance-firstp does not perform lowercasing:

from transformers import AutoTokenizer

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-roberta-base-ance-firstp")

print(tokenizer(["What is this?"]))
# >>> {'input_ids': [[0, 2264, 16, 42, 116, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}
# decodes to "What is this?"