KoslickiLab / mutation-rate-ci-calculator

This software calculates a confidence interval for the mutation rate from a set of observed containment indices under a simple nucleotide mutation process.
Other
7 stars 1 forks source link

is FracMinHash a LSH? #13

Closed jianshu93 closed 11 months ago

jianshu93 commented 11 months ago

Dear FracMinHash team,

My understanding of FracMinHash is that a much smaller number of hash functions can be used for estimating containment Jaccard and then Jaccard index (after obtaining the cardinality), especially set size are very different, e.g. a genome against a metagenome. However, the most important property of MinHash, is that the probability of having hash collisions is some monotonic function of some distance index, e.g., Jaccard index, so that it can be use for nearest neighbor search. In the case of FracMinHash, does the probability of having collions also a monotonic function for containment index, or more broadly, Jaccard index. That is the whether locality sensitive hashing property is preserved, compare to MinHash.

Thanks,

Jianshu

mahmudhera commented 11 months ago

Hi Jianshu, thanks for your interest! Off the top of my head: I think yes, (the unbiased version of) FMH preserves the locality-sensitive hashing property for the Jaccard, as well as the containment indices. I think this way because we have theoretical proofs that the expected value of these indices is the true value, which is also the case for MinHash.

If you think that is counter-intuitive, or if you have some line of thought to suggest otherwise, please let us know! Will be happy to chat 👍

jianshu93 commented 11 months ago

Hello Mahmudhera,

In the case of Jaccard, I assume the cardinality is obtained via some other probabilistic data structures, e.g., hyperloglog, but not exact methods, which means there are some error there (this error is larger than MinHash error, but not exist if using exact counting method). In the case of LSH, Jaccard is metric, so Minhash (A,B) = MinHash(B, A), this must be true for any locality hashing preserved schemes, otherwise we cannot use if for nearest neighbor search. The expectation of bottom-K implementation is also true Jaccard, but only one hash functions is used, so bottom-K implementation is not preserving locality sensitive hashing. This is why I asked the question.

Thanks,

Jianshu

mahmudhera commented 11 months ago

I am not familiar with bottom-K not preserving LSH (if you could point me to the paper, that would be great!). If that is the case, then perhaps FMH may not preserve LSH, theoretically speaking, as FMH takes the bottom "some percentage" of the hashes.

The cardinalities can be computed using HLL as an implementation. The theoretical definition of FMH, however, does not assume the use of HLL or some approximate counting. If the theoretical definition of FMH does assume HLL, then you are correct, the error has to be accounted for in any subsequent analyses.

jianshu93 commented 10 months ago

Hello,

See this paper for why the bottom-s MinHash is not LSH and thus scaled MinHash (https://ieeexplore.ieee.org/abstract/document/8104099).

Thanks,

Jianshu