epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Fix for issue: Weird results with nnSent (#10) #13

Closed annikaberger closed 7 years ago

annikaberger commented 7 years ago

By default the compare operator for the priority queue with pairs first compares the first elements of the pair. If those elements are the same, the second elements of the pair are compared. In your implementation this doesn't make sense, because when predicting nearest neighbor sentences for a query out of the corpora you would like to only rely on the similarity score and not on the real sentences. It's very often the case that the similarity score is the same for the dot product between query and corpora sentences. But that's ok. Only all of them need to be returned.

Thus, I suggest to use a custom comparator which only compares the first elements of the pair, i.e. the similarity scores.

sidak commented 7 years ago

In the example that you shared, don't you think it will still produce an incorrect order on your machine after this change? (For instance, as 0.18209 is not equal to 0.60233, but it is still ranked above it)

0.267163 995202 
0.182761 989931 
0.18209 989938 
0.60233 28851996 
0.581494 19254999 
0.577039 19257290 
0.577039 19134638 
0.577039 19104486 
0.575327 13409779 
0.567145 19095098

Another remark: I am not sure, but I guess the float comparison may be a possible issue.