ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.59k stars 296 forks source link

Minhash LSH Forest Top-K results differs from MinHash LSH #42

Closed pandasMX closed 6 years ago

pandasMX commented 6 years ago

Hi, after I switch from LSH to LSHforest. Although I set the same num_perm for them, I found the results were still quite different. For example, the top-40 I got from LSHforest is totally different from the sorted top 40 I got from LSH (e.g. I got a list of results from one LSH query, and then compute the weighted jaccard similarity between the query item and each of the result. After that i sort the results ordered from most similar to least similar and take the top 40). Could you check the performance for the LSHforest on weighted minhash? Besides that, if I use MinhashLSH instead, can I control the number of result returned for each query? Because our dataset is very large, the number of result is also very large but we only need the top few similar items. Calculating the distance between the query item and each of the returned result is quite a big cost.

ekzhu commented 6 years ago

On the different top-k query results: this is expected. Because the methods are all approximate rather than exact, results may not be 100% accurate. The common work flow of LSH Forest for top-k is to first retrieve "top-r" results from the index, sort the r number of results by exact similarity computation (or using MinHash signatures), and then output the top-k. r is bigger than k (e.g. 10x bigger).

For MinHashLSH, you cannot directly control the number of results returned. When your result sizes are large you may want to use a higher threshold (e.g. 0.9). Have multiple LSH indexes optimized for a series of descending thresholds can help too.

pandasMX commented 6 years ago

Thanks for your suggestions! You mentioned "Have multiple LSH indexes optimized for a series of descending thresholds can help too". Can you give me some more guidance on how can I achieve this? Thanks!

ekzhu commented 6 years ago

Nothing fancy. What I meant is just some version of the following:

lshes = {0.9 : MinHashLSH(threshold=0.9),
              0.8: MinHashLSH(threshold=0.8),
              0.7: MinHashLSH(threshold=0.7),
              0.6: MinHashLSH(threshold=0.6),
              0.5: MinHashLSH(threshold=0.5),}

So your query processing can start with the highest threshold, try the lshes in descending order of thresholds, until you get enough query results.

pandasMX commented 6 years ago

OK, I understand. Thanks!