PrincetonML / SIF_mini_demo

minimal example for sentence embedding by Smooth Inverse Frequency weighting scheme
MIT License
35 stars 11 forks source link

Negative similarity with rmpc = 1 #1

Open christophsk opened 6 years ago

christophsk commented 6 years ago

I'm unsure how to interpret these results. I am using glove.6B.300d.txt for the vectors (faster). No other changes to the code. The larger 840B does not alter anything here. It's curious that removing one principal component gives a score of exactly -1.0. I'm unsure how to interpret that result. Maybe I missed something in my reading of your paper.


principal components to remove: 0
SIF embedding inner product: 3.6800531047
SIF embedding norms: 2.08438143917, 2.10295656208
score: 0.839550039467

principal components to remove: 1
SIF embedding inner product: -0.351619142234
SIF embedding norms: 0.59611628534, 0.589849918348
score: -1.0```
NiceMartin commented 6 years ago

@christophsk I have the same result as you. When I set rmpc=0 (no principal component is removed), the similarity is 0.83

And I tried some other examples: 'I like football', 'I like soccer': 0.93(without PC), -1 (with PC) 'This is a bicycle', 'This is a bike' : 0.91 (without PC), -1 (with PC)

christophsk commented 6 years ago

I've run the original SIF code with no problems. I haven't had time to unwind that code however, the weighted_average_sim_rmpc() method used in eval.sim_evaluate_one() shows the reference implementation. It seems slightly different but until I dig into it, it's hard to tell. Evaluation is performed using the method in sim_algo.py.

I'm also working through the math of what's happening when rmpc = 1 in this code. There is a reason for rmpc = 0 and I'd like to understand it.

It would be good to put together a set of test vectors such that different embeddings can be evaluated against gloVe. That shouldn't be difficult with the original SIF code.

envibus commented 6 years ago

Hi @christophsk, do you mean you run the code https://github.com/PrincetonML/SIF ? What is the similarity score that you got for the sentence the example (sentences = ['this is an example sentence', 'this is another sentence that is slightly longer'] ) Thanks

YingyuLiang commented 6 years ago

When you have too few data points, this can happen.

In particular, for two data points, after removing their projections on their first singular vector, they become vectors v and -v, ie, they have cosine similarity -1.

Cumberbatch08 commented 5 years ago

Hi, too many blogs said the SIF method is the best unsupervised method. I trained my word2vec model, when I eval the similarity of two sentences, the result is either 1 or -1. The pca part, I calculate every two sentences. Is there any problem?

Cumberbatch08 commented 5 years ago

When you have too few data points, this can happen.

In particular, for two data points, after removing their projections on their first singular vector, they become vectors v and -v, ie, they have cosine similarity -1.

hi, the funny thing is that calculate the pca on whole corpus, and get the word embedding removed the pca, but I result is almost 1 or -1. so, how to calculate the pca on a reasonable number?