Open JunhaoWang opened 2 months ago
You're right. Cosine smiliarity can take on negative values, althought it is heavily biased towards positive values.
A few interesting experiments on this -
Keep in mind that vector embeddings are a result of computing the probability of the word in a given context. This means that beautiful
and ugly
, even though they are the opposites, they have a medium cosine score, since they're likely to appear in the same context. Completely unrelated phrases like quantum mechanics cryptography algorithms
and blue kingfisher eating salmon
have a negative cosine score.
However, in Dataset Generation, since all of the nodes have the same/related context, it is extremely unlikely to have a negative cosine score.
In doc,
This is wrong since cosine similarity can take on negative values