Open ajschumacher opened 9 years ago
oh wow, this is from a while ago...
thinking about doing something now, re: http://planspace.org/20170705-word_vectors_and_sat_analogies/
Law of Cosines approach:
but kind of circular...
ooh! better?
map to a post:
still need to explain why it works in multiple dimensions though?
Here are some notes from correspondence with Peter Turney (the word vector SAT questions guy, see: https://planspace.org/20170705-word_vectors_and_sat_analogies/).
The cosine of two vectors is equal to the inner product of the normalized vectors (normalizing to unit length). Therefore, in effect, cosine already has normalization built into it. A more fair comparison would be:
(1) Euclidean distance between two vectors that have not been normalized (2) Euclidean distance between two vectors normalized to unit length (3) Inner product between two vectors that have not been normalized (4) Inner product between two vectors normalized to unit length (cosine)
1 vs 2 3 vs 4 1 vs 3 2 vs 4
I hadn't run (3), so I added that; here are some pretty typical results (number correct using glove.6B.300d):
(1) 129 (2) 157 (3) 153 (4) 157
So 2 and 4 get nearly identical results (varying by at most one question) which makes sense because they're measuring very nearly the same thing (chord length vs. angle).
I was impressed by how well 3 does, always beating 1 by a substantial margin despite having no normalization. (For some embeddings, it even beats 4!)
The reason the comparison between 1 and 4 is interesting to me is that depending on the structure of the space, I could imagine 1 doing better. For example in a space formed by tSNE (like this example) very different clusters can be "in the same direction" (in that example, if the origin is the lower left, "barbecue" and "augmented reality" have about the same angle). It's true tSNE is usually just used for visualizations, and in all-positive (count-based, etc.) spaces it makes sense for distance from the origin to just encode, for example, document length and not difference in content per se. But for word embeddings from GloVe or word2vec I didn't have an intuition for what the space should be like. They're not all positive, for one thing. But maybe I shouldn't be surprised, since the word embedding process is probably roughly fitting gaussians (or at least not likely to be multimodal in any dimension) and in high dimensions there are so many different directions that no two clusters are likely to be in the same direction anyway?
"I was impressed by how well 3 does, always beating 1 by a substantial margin despite having no normalization. (For some embeddings, it even beats 4!)"
If you think about how the inner product works, this is not so strange. It's the sum of the element-wise products of the two vectors. If the two vectors both give a high value to a certain dimension (element, feature), then the product of those two values will be a really high value. So the inner product is rewarding vectors that agree on which features are really important.
"The reason the comparison between 1 and 4 is interesting to me is that depending on the structure of the space, I could imagine 1 doing better."
When you have an experiment where you're changing some variables to see what happens, ideally you only change one variable at a time. If you change two variables simultaneously and the results are quite different, you don't know which of the two variables is responsible. Two things are different between 1 and 4, the presence/absence of normalization and the type of vector operation, inner product/distance. Now that you have 2 and 3, you can tease apart these two separate things.
interesting connection: people use SLERP https://en.wikipedia.org/wiki/Slerp to avoid going to low-probability places in interpolation
see also this thing that Travis pointed me to, with pretty neat adjustments for doing things like analogies, interpolations, etc. https://arxiv.org/pdf/1711.01970.pdf
sometimes standardizing (as via cosine distance) can be intuitively justified, as in the case of word counts, where there's a natural consideration that ratios are more important than the counts (as in normalization by document length)
Hamming's The Art of Doing Science and Engineering, Chapter 9 on n-Dimensional space, page 118 in my copy, has a nice derivation/explanation of cosine distance which might be useful.
3Blue1Brown series S1 • E9 Dot products and duality | Chapter 9, Essence of linear algebra https://www.youtube.com/watch?v=LyGKycYT2v0&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&index=9
I think this might be the right explanation!
Some stuff still in here that would be good for talking about different distances/similarities...
just linking the post directly here that was referenced in b78cad8: https://planspace.org/20210904-dot_product_really_does_give_you_cosine/
pearson correlation == centered cosine similarity