Open divyag11 opened 2 years ago
How to do you compute the token embeddings and how do you measure the quality?
I was trying to get the token embeddings using:
from sentence_transformers import SentenceTransformer
all_model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
query_emb, sen_emb = all_model.encode(queries, output_value="token_embeddings"), \
all_model.encode(sentences, output_value="token_embeddings")
Eg: query = "return to mars" sentence = "can I return this product ?"
Based on the token embeddings of query and sentence, I am doing the cosine similarity of the return
token in query with the return
token in the sentence.
So, since the query and sentence mentioned above are not similar in meaning, the cosine_sim between return
token in query and sentence should not give a score. But, it does give a high similarity score.
I would argue that 'return' in both sentences are still fairly similar in the meaning in the sense that in both cases something returns to it's origin.
But nonetheless, sentence embeddings are created by the average of the token embeddings. But the length of the token embeddings are not the same: Important content words have large values, impacting the average more.
So it can be that 'return' has a rather small embedding vector, impacting the overall sentence embedding rather small.
In the SentenceTransformer . I wanted to use token embeddings. We pass the value of "output_value" as "token_embedding" to get the word embedding. My query is how do we map token and embedding here.
from sentence_transformers import SentenceTransformer model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
sentences = ['how to map token to word embedding'] embeddings= model.encode(sentences, output_value="token_embeddings")
In the SentenceTransformer . I wanted to use token embeddings. We pass the value of "output_value" as "token_embedding" to get the word embedding. My query is how do we map token and embedding here.
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
docs = [Doc1, Doc2.... etc]
docs_embeddings = embedding_model.encode(docs)
word_embeddings = embedding_model.encode(docs, output_value="token_embeddings")
token_ids = []
token_strings = []
tokenizer = embedding_model._first_module().tokenizer
for doc in docs:
ids = tokenizer.encode(doc)
strings = tokenizer.convert_ids_to_tokens(ids)
token_ids.append(ids)
token_strings.append(strings)
mapping between token_ids, word_embeddings and token_strings becomes very straightforward when you use this code, and you'll have access to each thing individually in case you need it. Hope this helps!
Hi I was trying out the sentence transformer (all_minilm) model. The sentence embeddings are of great quality. I wanted to use token embeddings too. but the token embeddings do not contain that much good context. I was expecting it to be in similar lines with BERT token embeddings, but the quality of the token embedding is not that good, as the sentence embeddings are. Is that expected? Thanks