runlength statistics - Githubissues

ctb commented 1 year ago

Per: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/fracminhash-runs-simulate.ipynb

in M=100 k-mers, p of finding at least one hash is: 63.43% - scaled=100
in M=200 k-mers, p of finding at least one hash is: 86.51% - scaled=100
in M=300 k-mers, p of finding at least one hash is: 95.11% - scaled=100
in M=400 k-mers, p of finding at least one hash is: 98.18% - scaled=100
in M=500 k-mers, p of finding at least one hash is: 99.32% - scaled=100
in M=600 k-mers, p of finding at least one hash is: 99.72% - scaled=100
in M=700 k-mers, p of finding at least one hash is: 99.90% - scaled=100
in M=800 k-mers, p of finding at least one hash is: 99.96% - scaled=100
in M=900 k-mers, p of finding at least one hash is: 99.98% - scaled=100
in M=1000 k-mers, p of finding at least one hash is: 99.99% - scaled=100
in M=1100 k-mers, p of finding at least one hash is: 100.00% - scaled=100
in M=1200 k-mers, p of finding at least one hash is: 100.00% - scaled=100
in M=1300 k-mers, p of finding at least one hash is: 100.00% - scaled=100
in M=1400 k-mers, p of finding at least one hash is: 100.00% - scaled=100
in M=1500 k-mers, p of finding at least one hash is: 100.00% - scaled=100
in M=1600 k-mers, p of finding at least one hash is: 100.00% - scaled=100
in M=1700 k-mers, p of finding at least one hash is: 100.00% - scaled=100
in M=1800 k-mers, p of finding at least one hash is: 100.00% - scaled=100
in M=1900 k-mers, p of finding at least one hash is: 100.00% - scaled=100
in M=2000 k-mers, p of finding at least one hash is: 100.00% - scaled=100

I think this is true (read thru notebook for logic) -

for two genomes of length N, and FracMinHash with scaled=100,

if the genomes have all hashes in common,

then the probability that they have more than M different k-mers is ^^^ table above ^^^.

this is independent of genomes size.

i.e. for any two genomes with all hashes same / at a scaled of 100, 99.99% of genomes will have fewer than 1000 actual k-mers different.

ctb commented 1 year ago

note: it scales with SCALED - so for scaled=1000, 99.99% of genomes will have fewer than 10*scaled actual k-mers different!

ctb commented 1 year ago

theory:

the probability of seeing exactly 0 hashes in a run of M k-mers with a scaled of SCALED is:

1 - the probability of seeing 0 hashes in a run of M k-mers with a scaled of SCALED,

which is 1 - Poisson(k=0, lambda=M / scaled)

which is 1 - e**(-M/scaled)

or in Python:

1 - math.exp(- (M/scaled))

which (per notebook ;) matches the distribution pretty darned well.

(the small deviations are either due to sampling statistics, or properties of murmurhash)

ctb commented 1 year ago

for genome size 5000, need scaled=50 to guarantee that 99.9% of genomes with all hashes same => < 1% different for genome size 50000, need scaled=500 to guarantee that 99.9% of genomes with all hashes same => < 1% different for genome size 500000, need scaled=4996 to guarantee that 99.9% of genomes with all hashes same => < 1% different for genome size 5e+06, need scaled=49951 to guarantee that 99.9% of genomes with all hashes same => < 1% different

or ...

for a given genome size N, need scaled = N/100 to guarantee that 99.9% of genomes with identical hashes are <1% different

ctb / 2022-sourmash-sens-spec

runlength statistics #7