Open christophsk opened 6 years ago
@christophsk I have the same result as you. When I set rmpc=0 (no principal component is removed), the similarity is 0.83
And I tried some other examples: 'I like football', 'I like soccer': 0.93(without PC), -1 (with PC) 'This is a bicycle', 'This is a bike' : 0.91 (without PC), -1 (with PC)
I've run the original SIF
code with no problems. I haven't had time to unwind that code however,
the weighted_average_sim_rmpc()
method used in eval.sim_evaluate_one()
shows the reference implementation. It seems slightly different but until I dig into it, it's hard to tell. Evaluation is
performed using the method in sim_algo.py
.
I'm also working through the math of what's happening when rmpc = 1
in this code.
There is a reason for rmpc = 0
and I'd like to understand it.
It would be good to put together a set of test vectors such that different embeddings can be evaluated against gloVe
. That shouldn't be difficult with the original SIF
code.
Hi @christophsk, do you mean you run the code https://github.com/PrincetonML/SIF ? What is the similarity score that you got for the sentence the example (sentences = ['this is an example sentence', 'this is another sentence that is slightly longer'] ) Thanks
When you have too few data points, this can happen.
In particular, for two data points, after removing their projections on their first singular vector, they become vectors v and -v, ie, they have cosine similarity -1.
Hi, too many blogs said the SIF method is the best unsupervised method. I trained my word2vec model, when I eval the similarity of two sentences, the result is either 1 or -1. The pca part, I calculate every two sentences. Is there any problem?
When you have too few data points, this can happen.
In particular, for two data points, after removing their projections on their first singular vector, they become vectors v and -v, ie, they have cosine similarity -1.
hi, the funny thing is that calculate the pca on whole corpus, and get the word embedding removed the pca, but I result is almost 1 or -1. so, how to calculate the pca on a reasonable number?
I'm unsure how to interpret these results. I am using
glove.6B.300d.txt
for the vectors (faster). No other changes to the code. The larger 840B does not alter anything here. It's curious that removing one principal component gives a score of exactly -1.0. I'm unsure how to interpret that result. Maybe I missed something in my reading of your paper.