bluenote-1577 / skani

Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs.
MIT License
159 stars 10 forks source link

Skani accuracy blow 85% ANI #13

Closed jianshu93 closed 1 year ago

jianshu93 commented 1 year ago

Hello Jim,

I ran an all versus all test for a collection of genomes we often use to test ANI calculators, not very much, a total of 300 genomes, spanning several genus in phylum actinobacteria phylum in NCBI bacterial genome database (thus cover 75% to 100% ANI very well/even). This is what I saw for (1) FastANI and (2) Skani, versus orthoANI_usearch. It seems above 85% ANI, Skani is pretty good and correlates well with orthoANI, similar to that of FastANI. However, below 85%, Skani variation increases significantly while FastANI is still good until below 80%, variation increases but still good enough to be trusted until around 76%. I attached the figures below. Above 85% ANI, Mash works pretty well actually according to the FastANI paper and is much faster than both FastANI and Skani, despite no alignment fraction. I understand that minimizer with large sliding window will be problematic but for now with s=24, it is still good, no significant problem in practice. I am wondering what could be next to further improve FastANI/skani. FastANI limiting step is to finding homology via minimizer while Skani is doing this using the new seeding and chaining algorithm, I assume this is also the limiting step because MinHash and FracMinHash based distance/identity estimation will be extremely fast, considering recent cutting edge MinHash algorithms such as B-bit One Permutation MinHash with optimal densification (https://academic.oup.com/bioinformatics/article/35/4/671/5058094).

ANIu_vs_FastANI aniu_vs_skani

Thanks,

Jianshu

bluenote-1577 commented 1 year ago

Hi Jianshu,

Thanks for the plot. I think this confirms my general feelings about skani vs FastANI on reference genomes; fastANI has a bit less variance (but note on your plot, it seems to be slightly upward biased at 75%). I am not planning on focusing on low ANI (below 85%) for skani, so I think this is fine. If users want more sensitive homology search, I'm fine with them using a BLAST based method instead.

In my opinion, there is use for both minhash/Fracminhash pure sketching methods, and fastani/skani "hybrid" alignment/sketching methods. The former is fast, but as we showed in our paper, handles incompleteness w.r.t MAGs more poorly than the hybrid alignment methods. I think they complement each other.

I'll close this issue for now though, since there are no specific action items.

Jim