Euclidean distance instead of cosine-similarity?

I'm wondering why in the paper you linked "Enriching Word Vectors with Subword Information" they compute the similarity between words as the cosine-similarity of the embedded vectors. The cosine-similarity is also used in most of the projects people have developed using the vectors you shared with us.

The cosine-similarity has been applied very successfully to the term vector space, where the scalar of a vector give as more information on the lenght of the orginal document, instead on its context.

But if the space is shaped by a NN just to discriminate the context given a word, also the scalar of the vectors help the NN to learn the best representation and then generate its output.

Example: The NN can put a set of related words around the origin of the embedded space, just to map this small weights to a context. While the cosine similarity between this words can be very high, their euclidean distance will be small.

Do you have any benchmark on this topic that justifies your decision to not take into account the scalar of the vectors, and so to use the cosine similarity instead of the euclidean distance?

Thanks

@luca-mastrostefano I am not entirely sure what you mean by "scalar of a vector", are you referring to the norm of the word vectors?

The reason why the cosine distance works so well in practice is still a bit mysterious -- like a lot about word vectors. The cosine distance works usually better than other distance measures because the norm of the vector is somewhat related to the overall frequency of which words occur in the training corpus. The direction of the vector and inherently the cosine distance is unaffected by this, so a common word like "frog" will still be similar to a less frequent word like "Anura" which is it's scientific name.

The cosine distance is equivalent to the inner product of the normalized word vectors (e.g scaling the vectors so their length becomes equal to 1.) You can very loosely interpret the inner product of word vectors to be an approximation of the pointwise mutual information (PMI) between the two words they represent. This is also an avenue researchers took in order to relate the neural word vectors to the older word embeddings resulting from a low-rank approximation of a co-occurrence matrix via SVD. That the inner product relates to the PMI between the vectors is for the most part an empirical result and there is very little theoretical background behind this finding. The learning algorithm tries to maximize the inner product of the words that occur in similar contexts and tries to minimize it otherwise. Unlike the Euclidean distance, using the inner product as a distance measure makes the math for the calculation of the gradient used by the optimization routine pretty nice.

If you want to learn more about it, I recommend having a look at this paper by Sanjeev Arora et al. and also this one by Omer Levy and Yoav Goldberg who have looked at your question and related topics in detail.

facebookresearch / fastText

Euclidean distance instead of cosine-similarity? #189