bluenote-1577 / skani

Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs.
MIT License
159 stars 10 forks source link

skani versus bindash #15

Closed jianshu93 closed 1 year ago

jianshu93 commented 1 year ago

Hello Jim,

In the main paper you compare skani with Mash, but not bindash, another important breakthrough of MinHash based on B-bit One Permutation MinHash with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a.html), which is about 1000 time faster than Mash, both sketch and dist. This is due to the theoretical breakthrough of MinHash, from time complexity O(d*k) to O(d+k) using 2-universal hashing, where d is the nozero element in set while k is number of MinHashes, in practice, k is like 10^4 or something to have 99.9% ANI accuracy, while d is often, 3x10^6 or so, making k negligible. I understand that MinHash is somehow biased by set size, but in terms of speed, bindash should be the new standard for benchmark instead of the old k-bottom sketch behind Mash, which is 20 years ago.

Thanks,

Jianshu

bluenote-1577 commented 1 year ago

I quickly tested BinDash https://github.com/zhaoxiaofei/bindash on my server.

The sketch command with 20 threads on ~4000 complete refseq genomes took 27 seconds. skani tooks 5 seconds with the same setup.

/usr/bin/time skani sketch -t 20 * User time (seconds): 53.07 System time (seconds): 5.40 Percent of CPU this job got: 1893% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.08 Average shared text size (kbytes): 0

/usr/bin/time -v ./bindash sketch --nthreads=20 User time (seconds): 488.31 System time (seconds): 1.53 Percent of CPU this job got: 1803% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:27.15

The timings of dist/triangle mash in the paper are not due to actual computation time, it is mainly due to file io in Mash.

jianshu93 commented 1 year ago

Hello Jim,

I am not talking about sketch, but distance computation, this is where bindash is much faster, for sketching, dashing will be fastest instead of bindash due to SIMD. I believe no matter what MinHash, FracMinhash or its variants, none can be faster than MinHash optimal densification idea in theory. All versus all distance computation, bindash can finish in an hour for 350,000 genomes, using sketch size 12,000, while others took at least several days. If skani does not skip comparisons below 80% ANI (I saw that many comparisons are empty/skipped for large database), where a majority of the comparison came from for large genomic database like all NCBI/RefSeq genomes, I doubt it can be faster than BinDash and even dashing (much slower actually). FastANI does not skip pairs but compute all despite less accurate when ANI is smaller than 78% (this is many pairs, more than 80% of the pairs if you plot a number of ANI pairs versus ANI value density plot for all refseq genomes). Can you please do not skip the comparisons below 80% but compute them, and let users know that they are not reliable, this is how fastANI did and all Mash and bindash is evaluated. Then the speed benchmark is fair, otherwise it is not fair saying that skani is much faster than Mash and fasANI.

Did any reviewer point the bindash paper to you that for high than 85% ANI, Mash/bindash, is as accurate as FastANI (see fastANI figure), for the identity (not alignment ratio), and alignment ratio above 85% ANI is not so important because generally they share enough genomic fragments to compare.

Thanks,

Jianshu

bluenote-1577 commented 1 year ago

Dear Jianshu,

  1. We never claim in the paper that skani is faster than Mash for large scale all to all comparison. See the supplementary figure which shows mash is faster than skani on all to all comparison for Alistipes ihumii. Skani is faster than Mash on indexing, this is not a lie, but Mash is very clearly faster than skani in figure 2. We literally say in the Readme skani is slower than Mash.

  2. We do not want to output unreliable ANIs to the user and change our methodology. We did not output FastAni times in figure 2 except for e coli runtime out of fairness to Fastani. We mention this in the caption.

  3. For high ani mash is not as good as other methods for MAGS. The whole point is that incomplete MAGs make mash unreliable. The point of skani is that it is more robust for incompleteness.

jianshu93 commented 1 year ago

Hello Jim,

Thanks for the explanation on Mash. Still I think bindash should be compared instead of Mash, those 2 have nearly identical results while bindash is much faster. I think this claim in abstract "skani is more accurate than FastANI for comparing incomplete, fragmented MAGs while also being > 20 times faster", makes me feel that no matter what the database is, it is more accurate than and faster than fastANI, this is only true for E.coli genomes or highly similar > 85% identity genome database. No matter what the completeness is below 85% to 75% ANI, it will not be more accurate than fastANI (a simple linear correction will solve fastANI not close to ANIu for 75% to 85% in the figure I mentioned here https://github.com/bluenote-1577/skani/issues/13, which is also part of skani), but only faster. In the case of triangle, skani is much faster than Mash in figure 2 b (f), is strange because all smaller than 80% ANI pairs are not computed for skani, but Mash computed all and bindash for the same step is 1000 times faster than Mash, also computed all pairs. I have no doubt on the fact that for less complete genomes, skani is better. Just several speed comparisons and claims are too general in abstract, without specifying what genomic identity range, or whether all pairs are computed.

Overall, I love it and for highly similar genomes comparisons and search, I use it very often since first release. Memory is a big problem for fastANI, which skani do not have, I feel this is more important than speed for all versus all or triangle, for database that are more distantly related.

Many thanks for all those detailed responses.

Jianshu

bluenote-1577 commented 1 year ago

Hi Jianshu,

I think this claim in abstract "skani is more accurate than FastANI for comparing incomplete, fragmented MAGs while also being > 20 times faster", makes me feel that no matter what the database is, it is more accurate than and faster than fastANI, this is only true for E.coli genomes or highly similar > 85% identity genome database.

  1. We show in the Supplementary Figures that skani has less deviation from ANIm compared to FastANI on four additional data sets of MAGs. We feel like our wording is fair; we never claim skani is more accurate on whole genomes. We show in Fig 2 that it is not necessarily more accurate for whole genomes. We disagree that the wording "... incomplete, fragmented MAGs" makes it seem that skani is more accurate "no matter what the database is...".
  2. We also show that skani triangle is faster than FastANI's matrix mode by > 45x in the supplementary materials. We feel that for the simple and obvious test case of comparing two genomes, our timing results are not misleading. For any computational method, one could quantify every edge-case for which their claims are not true, but we feel like we have been fair in our case, although we understand that differing views on what constitutes an "accurate" benchmarking statement exist.

No matter what the completeness is below 85% to 75% ANI, it will not be more accurate than fastANI (a simple linear correction will solve fastANI not close to ANIu for 75% to 85% in the figure I mentioned here https://github.com/bluenote-1577/skani/issues/13, which is also part of skani), but only faster.

  1. We claim only good results for down to 82% ANI, see the README. See point 2. above for our feelings about quantifying edge-cases in benchmarks.
  2. FastANI does not implement such a linear correction, even if it is possible.

In the case of triangle, skani is much faster than Mash in figure 2 b (f), is strange because all smaller than 80% ANI pairs are not computed for skani, but Mash computed all and bindash for the same step is 1000 times faster than Mash, also computed all pairs.

skani uses a FracMinHash filter to filter out genomes with less than 80% ANI. This filter requires computing k-mer interesctions, so we compute k-mer set intersections for all genomes. This means skani is also running a subroutine similar to Mash or Sourmash in the filter on all-to-all genomes, meaning our timings are perfectly fair.

The longer runtime has nothing to do with skipping comparisons and it has to do with the fact that mash by default loads genomes into memory only as needed, thus mash is limited by file i/o. You can see this in the memory plot in Figure 2, notice how mash doesnt load all genomes into memory at once.

Overall, I love it and for highly similar genomes comparisons and search, I use it very often since first release. Memory is a big problem for fastANI, which skani do not have, I feel this is more important than speed for all versus all or triangle, for database that are more distantly related.

Thanks for the positive notes on our method.