Bad results when looking for similar words and not sentences

AsmaZbt commented 3 years ago

hello ! thank you so much for sharing this beautiful work and specially for sharing the examples of applications.

i have tried to modify your code en example of : semantic_search.py and semantic_search_quora_annoy.py to get similar words and not to get similar sentences .

but I got bar results like this :

using : simantic_search.py

Query: man

Top 5 most similar words in corpus: A (Score: 0.2989) food (Score: 0.2878) a (Score: 0.2809) man (Score: 0.2708) is (Score: 0.2699) [[2158], [2619]]

======================

Query: Someone

Top 5 most similar words in corpus: A (Score: 0.2989) food (Score: 0.2878) a (Score: 0.2809) man (Score: 0.2708) is (Score: 0.2699)

and using simantic_search_quora_annoy.py

Corpus loaded with 100000 sentences / embeddings Please enter a word: uncertainty Input word: uncertainty Results (after 2.949 seconds): 0.757 top 0.652 Are 0.647 permanent 0.647 What 0.646 colour 0.645 Pune 0.645 the 0.644 ? 0.642 What 0.641 rafting

Approximate Nearest Neighbor Recall@10: 40.00 Missing results: 0.723 knows 0.712 are 0.692 emotion 0.692 Why 0.678 the 0.661 are

So I thought that the resulting similar words are the words that occur in the same context from the corpus . and because we are in contextual dependance the similar words embeddings are the worsd that occur in the same context and not the same words in other contexts . for example for the simantic_search the corpus of train is :

corpus = ['A man is eating food.', 'A man is eating a piece of bread.', 'The girl is carrying a baby.', 'A man is riding a horse.', 'A woman is playing violin.', 'Two men pushed carts through the woods.', 'A man is riding a white horse on an enclosed ground.', 'A monkey is playing drums.', 'A cheetah is running behind its prey.' ]

and the nearest word given by my modified code is :

Top 5 most similar words in corpus: A (Score: 0.2989) food (Score: 0.2878) a (Score: 0.2809) man (Score: 0.2708) is (Score: 0.2699) [[2158], [2619]]

the real similar word is man and not food or the articles A or is .

so is my analysis is correct or I did a mistakes in my modified code ? and how can I get the similar word using this tools please?

thank you so much

nreimers commented 3 years ago

Can you post a minimal (self contained) example of your modified code?

BERT and SBERT were trained on complete sentences. Hence, for individual words, I don't expect the best results. There, traditional word embedding models like word2vec or GloVe are better.

Avg. word2vec and Glove sentence embedding models are also available as pre-trained models that can be used. They also work well for single words.

AsmaZbt commented 3 years ago

from sentence_transformers import SentenceTransformer, util
import numpy as np
import nltk
import torch

def get_bert_embeddings(sentences, model, batch_size):
    arr = [model.tokenize(sent) for sent in sentences]
    arr = [list(a) for a in arr]
    embeddings = []
    for i in range(0, len(sentences), batch_size):
        print(arr[i : i + batch_size])
        b_emb = model.encode(arr[i : i + batch_size], output_value='token_embeddings', convert_to_tensor=True, is_pretokenized=True, show_progress_bar=False,batch_size=batch_size)

        embeddings.append(b_emb)

        E = [item for sublist in embeddings for item in sublist]

    total_embedding = torch.cat(E, dim=0)

    return total_embedding

embedder = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
embeddings =  get_bert_embeddings(corpus ,embedder, batch_size=64)

queries = ['man', 'Someone']
list_words=[]
for sent in corpus:
    tokens = nltk.word_tokenize(sent)
    for token in tokens:
        list_words.append(token)

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:

    query_embeddings= get_bert_embeddings(queries ,embedder, batch_size=64)
  #  query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embeddings, embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar words in corpus:")

    for idx in top_results[0:top_k]:
        print(list_words[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))

AsmaZbt commented 3 years ago

thank you so much for the answering.

The word embedding do not capte the different senses of a given word , because they have one representation for the word. and the meaning of a word depends highly with the context, BERT and SBERT depends on the contextual embeddings , so i think that BERT or SBERT must give us better results in terms of differentiating the meanings of a words.

we can get easily get the nearest meanings sentences using bert so why we can't do that with words?

Shafi2016 commented 3 years ago

Thanks a lot, nreimers!! I and AsmaZbt are working on the same problem. We found that bert-embedding 1.0.1 [(https://pypi.org/project/bert-embedding/) produces better results for finding similar words. We wanted to use a sentence-transformers as it will give us access to a variety of models such as Roberta or we will be able to use our fine-tune model. Bert-embedding 1.0.1 is also trained in sentences but It still gives better results. With sentence- transformer I get as for query [there is uncertainty]

with Bert-emdedding I get better similar words for the same query as above

Shafi2016 commented 3 years ago

Hello nreimers, If we fine-tune SBERT on words, do you think, we will get better similar words? Also can we fine tune SBERT on words? I fine-tune SBERT on sentences for semantic search it performance is increased a lot.

nreimers commented 3 years ago

Hi, the tokenization of your input to encode() looks quite strange and quite complicated. Are you sure, that there right values are passed there to encode().

Why not doing just:

 b_emb = model.encode(sentences, output_value='token_embeddings', convert_to_tensor=True,show_progress_bar=False,batch_size=len(sentences))

Or

 b_emb = model.encode(sentences[i:i+batch_size], output_value='token_embeddings', convert_to_tensor=True,show_progress_bar=False,batch_size=batch_size)

Shafi2016 commented 3 years ago

Thanks! the suggested above mentioned change does not produce any change in the results. The results are the still the same.

AsmaZbt commented 3 years ago

Hi, the tokenization of your input to encode() looks quite strange and quite complicated. Are you sure, that there right values are passed there to encode().

Why not doing just:
 b_emb = model.encode(sentences, output_value='token_embeddings', convert_to_tensor=True,show_progress_bar=False,batch_size=len(sentences))
Or
 b_emb = model.encode(sentences[i:i+batch_size], output_value='token_embeddings', convert_to_tensor=True,show_progress_bar=False,batch_size=batch_size)

that's do not change the results. but the shape of b_emb is (9,15,768) 9 sentences 15 the max number of wordpieces 768 the vector of one wordpiece

so I think we get the nearest neighbors using wordpieces vectors and not the real tokens.

UKPLab / sentence-transformers

Bad results when looking for similar words and not sentences #435