UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.16k stars 2.47k forks source link

High cosine similarity with just one word #536

Open ankitkr3 opened 3 years ago

ankitkr3 commented 3 years ago

Hi guys, thanks for your continuous support and work

I am trying to find semantic similarity using Roberta large model but I am getting very high score unnecessarily. For example :

ideal text : The early explorers and traders shaped our history by changing the way indians lived and by learning about new land for the U.S. The traders shaped our history by changing indians traditions. For example the indians use to use every part of a buffalo. Then they started to kill buffalo only for their pelts so they could trade them with the traders. The explorers shaped our history by discovering Pikes Peak. If Pike never climbed pikes peak it probably wouldn't be named that. In conclusion, traders and explorers shaped our history.

compared text : History

Score generated : 30% using cosine similarity.

Expected Score : 0-5 %

nreimers commented 3 years ago

Hi @ankitkr3 This is not how it works.

Taking the pure score is not meaningful, you usually have to compare, i.e. cossim(A, B) vs. cossime(A, C).

Further, if all embeddings are in the positive space of the vector space, you would expect a cossim score of 0.5 for two random points

ankitkr3 commented 3 years ago

@nreimers from where should i take cossim(A,C), when I only have two sentences? Please explain it a bit.

nreimers commented 3 years ago

As mentioned the scores by itself are not meaningful, you cannot say if 30% is a high or low value. It only makes sense when you compare it with other examples.