gakhov / pdsa

Probabilistic Data Structures and Algorithms in Python
http://pdsa.readthedocs.io
MIT License
121 stars 19 forks source link

questions on bottom-sketch #30

Open jianshu93 opened 1 year ago

jianshu93 commented 1 year ago

Dear Anndrii,

I am writing to ask about the actually implementation of MinHash, there are 3 different implementations, k-min sketch, bottom-k sketch, which is:

a ‘bottom sketch’ strategy, originally proposed by Broder (1997), is commonly used to implement the MinHash algorithm, Instead of using m hash functions, all elements from a given set A are passed through a single hash function and the smallest m hash values (instead of elements) are stored in a sorted sketch S_b (A) of size m. The probability that sketch S_b(A), S_b(B) share a hash value represents the probability of random sampling a shared element from the union of set A and B. So, the resemblance of set A, B can be quickly estimated by counting the matched values between S_b(A) and S_b(B). The idea is mentioned here: http://www.cohenwang.org/edith/Surveys/minhash.pdf

This idea will increase the variance of the estimator right because only one hash function is used. More hash functions used, it will be more accurate. What to you think.

Thanks,

Jianshu