Open naserahmadi opened 3 years ago
You can use this code:
tokens = ["TOK1", "TOK2"]
word_embedding_model = model._first_module() #Your models.Transformer object
word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
Hello, this method doesn't seem to work for a CrossEncoder:
roberta.auto_model.resize_token_embeddings(len(roberta.tokenizer))
AttributeError: 'CrossEncoder' object has no attribute 'auto_model'
first_module = roberta._first_module()
'CrossEncoder' object has no attribute '_first_module'
How would I resize token embeddings for a CrossEncoder?
I used roberta.model.resize_token_embeddings(len(roberta.tokenizer))
, and the code works.
Is this the correct way to go about? Thank you!
Yes, it is correct.
You can use this code:
tokens = ["TOK1", "TOK2"] word_embedding_model = model._first_module() #Your models.Transformer object word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True) word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
In my case, I'm using my own model and this way makes error at model.encode()
.
(Error Message: 'Transformer' object has no attribute 'encode')
So I use this code.
from sentence_transformers import SentenceTransformer, models
model = SentenceTransformer("MY MODEL")
tokens = ["TOK1", "TOK2"]
word_embedding_model = model._first_module()
word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])```
Does sentence Transformer by default have any special tokens like [SEP],[CLS] ? @nreimers @BitnaKeum
Hello, I am trying to use
model.tokenizer.add_special_tokens(special_tokens_dict)
to add some special tokens to the model. But after doing that i received indexing error(IndexError: index out of range in self )
when i wanted to encode a sentence. I wonder to know how i can learn the vector representations of new tokens? something likemodel.resize_token_embeddings(len(t))