Open ctb opened 1 year ago
note: it scales with SCALED - so for scaled=1000, 99.99% of genomes will have fewer than 10*scaled actual k-mers different!
theory:
the probability of seeing exactly 0 hashes in a run of M k-mers with a scaled of SCALED is:
1 - the probability of seeing 0 hashes in a run of M k-mers with a scaled of SCALED,
which is 1 - Poisson(k=0, lambda=M / scaled)
which is 1 - e**(-M/scaled)
or in Python:
1 - math.exp(- (M/scaled))
which (per notebook ;) matches the distribution pretty darned well.
(the small deviations are either due to sampling statistics, or properties of murmurhash)
for genome size 5000, need scaled=50 to guarantee that 99.9% of genomes with all hashes same => < 1% different for genome size 50000, need scaled=500 to guarantee that 99.9% of genomes with all hashes same => < 1% different for genome size 500000, need scaled=4996 to guarantee that 99.9% of genomes with all hashes same => < 1% different for genome size 5e+06, need scaled=49951 to guarantee that 99.9% of genomes with all hashes same => < 1% different
or ...
for a given genome size N, need scaled = N/100 to guarantee that 99.9% of genomes with identical hashes are <1% different
Per: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/fracminhash-runs-simulate.ipynb
I think this is true (read thru notebook for logic) -
for two genomes of length N, and FracMinHash with scaled=100,
if the genomes have all hashes in common,
then the probability that they have more than M different k-mers is ^^^ table above ^^^.
this is independent of genomes size.
i.e. for any two genomes with all hashes same / at a scaled of 100, 99.99% of genomes will have fewer than 1000 actual k-mers different.