bluenote-1577 / skani

Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs.
MIT License
159 stars 10 forks source link

question on FracMinHash comparison with Minimzer+MinHash #27

Closed jianshu93 closed 4 months ago

jianshu93 commented 6 months ago

Hi Jim,

I need to compare FracMinHash with Minimizer + MinHash to see their Jaccard estimation accuracy. The key step in both is the sketch size for MinHash and FracMinHash: FracMinHash is in the original space (sequence) while Minimizer MinHash is in the minimizer space (much smaller, sampling density is only 1/(2+w), where w is the minimizer window size, so for 3000 bp fragment for example, total number of minimizer kmers is only several hundred). Apparently, minimizer MinHash sketch size can be only several hundred (200-500 in practise in fastANI) while for FracMinHash, since we did not sample, if using the same sketch size 200, there is no way it can be as accurate as Minmizer MinHash, but if we use a larger sketch size, like 1000+ for the original sequence space in FracMinHash, it is then not a fair comparison since minimizer itself is somehow a sketching step like MinHash to extract minimum hash value in a window. That is we are comparing 2 sketching algorithms with just one sketching algorithm. It is very difficult to determine theoretically what is the equivalent sketch size in Minimizer+MinHash all together so that we can use the same size for FracMinHash to compare. I think we need to prove first that the equivalent sketch size in Minimizer+MinHash is bounded by some range then we can use the same sketch size for FracMinHash. Do you have any idea on this topic? Or it is clear that there is a theoretical analysis that one is better than the other if use the same sketch size.

Thank you,

Jianshu

bluenote-1577 commented 4 months ago

Hi Jianshu,

I believe you're asking about how to exactly compare minimizers vs fracminhash? I think just setting both to the same density should be okay... but they do have different properties. Interesting discussion... I don't have any additional thoughts.

Thanks,

Jim