UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.14k stars 2.46k forks source link

Quality of word embeddings using Sentence Transformer models #1329

Open divyag11 opened 2 years ago

divyag11 commented 2 years ago

Hi I was trying out the sentence transformer (all_minilm) model. The sentence embeddings are of great quality. I wanted to use token embeddings too. but the token embeddings do not contain that much good context. I was expecting it to be in similar lines with BERT token embeddings, but the quality of the token embedding is not that good, as the sentence embeddings are. Is that expected? Thanks

nreimers commented 2 years ago

How to do you compute the token embeddings and how do you measure the quality?

divyag11 commented 2 years ago

I was trying to get the token embeddings using:

from sentence_transformers import SentenceTransformer
all_model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
query_emb, sen_emb = all_model.encode(queries, output_value="token_embeddings"), \
                         all_model.encode(sentences, output_value="token_embeddings")  

Eg: query = "return to mars" sentence = "can I return this product ?"

Based on the token embeddings of query and sentence, I am doing the cosine similarity of the return token in query with the return token in the sentence.

So, since the query and sentence mentioned above are not similar in meaning, the cosine_sim between return token in query and sentence should not give a score. But, it does give a high similarity score.

nreimers commented 2 years ago

I would argue that 'return' in both sentences are still fairly similar in the meaning in the sense that in both cases something returns to it's origin.

But nonetheless, sentence embeddings are created by the average of the token embeddings. But the length of the token embeddings are not the same: Important content words have large values, impacting the average more.

So it can be that 'return' has a rather small embedding vector, impacting the overall sentence embedding rather small.

abdullahfurquan commented 1 year ago

In the SentenceTransformer . I wanted to use token embeddings. We pass the value of "output_value" as "token_embedding" to get the word embedding. My query is how do we map token and embedding here.

from sentence_transformers import SentenceTransformer model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')

sentences = ['how to map token to word embedding'] embeddings= model.encode(sentences, output_value="token_embeddings")

Allaa-boutaleb commented 7 months ago

In the SentenceTransformer . I wanted to use token embeddings. We pass the value of "output_value" as "token_embedding" to get the word embedding. My query is how do we map token and embedding here.

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')

docs = [Doc1, Doc2.... etc]
docs_embeddings = embedding_model.encode(docs)
word_embeddings = embedding_model.encode(docs, output_value="token_embeddings")

token_ids = []
token_strings = []
tokenizer = embedding_model._first_module().tokenizer

for doc in docs: 
    ids = tokenizer.encode(doc)
    strings = tokenizer.convert_ids_to_tokens(ids)
    token_ids.append(ids)
    token_strings.append(strings)

mapping between token_ids, word_embeddings and token_strings becomes very straightforward when you use this code, and you'll have access to each thing individually in case you need it. Hope this helps!