epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Cosine similarity above 1 #78

Closed matthias-herrmann closed 4 years ago

matthias-herrmann commented 5 years ago

Calculating the centroid for some sentence vectors and then calculating the cosine similarity of the centroid and another sentence is returning a cosine similarity above 1. I'm using the pretrained sent2vec_wiki_bigrams model.

from scipy.spatial import distance
distance.cosine(self.centroid, vector)

Is there a way that I get only values between 0 and 1?

mpagli commented 5 years ago

You probably need to normalize your embeddings.

matthias-herrmann commented 5 years ago

@mpagli I have already done that, maybe it's the way of words are treated which are not part of the vocabulary of the model or there is something else wrong. I need to do some further research on that

mpagli commented 5 years ago

If the word is not in the vocab sent2vec will give you an empty vector with 0 norm. This might be the trigger. Maybe check if the word is in the vocab before, or check if the norm is zero.