ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.59k stars 296 forks source link

Similarity matrix #47

Closed Navein closed 6 years ago

Navein commented 6 years ago

Hi, how can I generate a similarity matrix by using minhash LSH? Minhash seems to compute only the jaccard comparison while minhash LSH outputs a list of candidates according to the similarity threshold set. I would like to use the similarity matrix for further clustering, and would like to know if this is possible with this package.

ekzhu commented 6 years ago

Hi Navein. By similarity matrix you mean pair-wise Jaccard similarity score for every pair of sets? If your goal is to have the exact similarity scores, then this package cannot help you.

If you are okay with approximate Jaccard similarity scores, then you can create a MinHash for each set, and compute all pairs using the MinHashes. This should be faster than computing the exact scores, if the sets are mostly larger than the number of hash values used in MinHash.