UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.08k stars 2.46k forks source link

Adding special tokens to the model #744

Open naserahmadi opened 3 years ago

naserahmadi commented 3 years ago

Hello, I am trying to use model.tokenizer.add_special_tokens(special_tokens_dict) to add some special tokens to the model. But after doing that i received indexing error (IndexError: index out of range in self ) when i wanted to encode a sentence. I wonder to know how i can learn the vector representations of new tokens? something like model.resize_token_embeddings(len(t))

nreimers commented 3 years ago

You can use this code:

tokens = ["TOK1", "TOK2"]
word_embedding_model = model._first_module()   #Your models.Transformer object
word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
Aatlantise commented 3 years ago

Hello, this method doesn't seem to work for a CrossEncoder:

roberta.auto_model.resize_token_embeddings(len(roberta.tokenizer))
AttributeError: 'CrossEncoder' object has no attribute 'auto_model'

first_module = roberta._first_module()
'CrossEncoder' object has no attribute '_first_module'

How would I resize token embeddings for a CrossEncoder?

Aatlantise commented 3 years ago

I used roberta.model.resize_token_embeddings(len(roberta.tokenizer)), and the code works.

Is this the correct way to go about? Thank you!

nreimers commented 3 years ago

Yes, it is correct.

BitnaKeum commented 2 years ago

You can use this code:

tokens = ["TOK1", "TOK2"]
word_embedding_model = model._first_module()   #Your models.Transformer object
word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))

In my case, I'm using my own model and this way makes error at model.encode().
(Error Message: 'Transformer' object has no attribute 'encode')
So I use this code.


from sentence_transformers import SentenceTransformer, models

model = SentenceTransformer("MY MODEL")
tokens = ["TOK1", "TOK2"]
word_embedding_model = model._first_module()
word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])```
kitrak-rev commented 1 year ago

Does sentence Transformer by default have any special tokens like [SEP],[CLS] ? @nreimers @BitnaKeum