Open atasoglu opened 3 months ago
Hello!
I believe this is a configuration issue on the side of boun-tabi-LMG/TURNA
. Their tokenizer returns a token_type_ids
, when it really should not, as the model seems to not use them. Sentence Transformers assumes that if the tokenizer returns token_type_ids
, it's because the model requires it, so it's passed to the model.
See e.g. the following script:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("boun-tabi-LMG/TURNA")
tokenizer = AutoTokenizer.from_pretrained("boun-tabi-LMG/TURNA")
inputs = tokenizer("Merhaba dünya!", return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
This also returns:
TypeError: T5Model.forward() got an unexpected keyword argument 'token_type_ids'
I suspect this is because the configured tokenizer class here is PreTrainedTokenizerFast
, and not e.g. T5TokenizerFast
. The former seems to assume that the model has token_type_ids
as one of the model inputs: https://github.com/huggingface/transformers/blob/0bd58f1ce0573c0e3269de4215a17d318add49b9/src/transformers/tokenization_utils_base.py#L1561
So, the patch is as follows:
from sentence_transformers import models, SentenceTransformer
t5_model = models.Transformer("boun-tabi-LMG/TURNA")
pooling_model = models.Pooling(t5_model.get_word_embedding_dimension(), pooling_mode="mean")
model = SentenceTransformer(modules=[t5_model, pooling_model])
# Remove token_type_ids from the tokenizer's model input names, as the model does not use it
model.tokenizer.model_input_names.remove("token_type_ids")
embeddings = model.encode(["Merhaba dünya!"])
print(embeddings.shape)
(1, 1024)
And now you can use the model or finetune it as normal. Hope this helps.
You can also open a discussion at https://huggingface.co/boun-tabi-LMG/TURNA that the model_input_names
for their tokenizer might not be configured well, or that they might want to change the tokenizer class (e.g. T5TokenizerFast has the correct model_input_names
here)
It worked! Thank you very much for your detailed answer and thoughtful advice on the tokenizer!
Hi,
I am trying to use boun-tabi-LMG/TURNA, a Turkish T5 model, with sentence-transformers as it has been specifically pre-trained for Turkish.
While trying with the code snippet below, I encountered a TypeError as I shared below.
Out:
Thank you in advance for your assistance and guidance!