Question: How does skani handle ambiguous ("N") bases?

bluenote-1577 / skani

Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs.

MIT License

166 stars 10 forks source link

Question: How does skani handle ambiguous ("N") bases? #26

Closed nrminor closed 7 months ago

nrminor commented 8 months ago

Hello,

I'm working on a tool that computes a pairwise distance matrix of many SARS-CoV-2 consensus sequences. However, these sequences almost always contain at least a few non-ATGC ambiguous bases that count toward edit distances like Levenshtein. Would these bases also contribute to a lower ANI in skani triangle? Ultimately, I need my tool to ignore characters other than A, T, G, C, and -, so I figured I'd ask how skani handles this before I try to reinvent the wheel.

Thanks for your help! --Nick

bluenote-1577 commented 8 months ago

Hi @nrminor,

skani ignores N bases, but the algorithm itself is somewhat involved, so how N bases affect ANI is a bit weird. It could lower or increase the ANI.

As long as you don't have too many Ns, it should be okay. See https://github.com/bluenote-1577/skani/issues/16 for details.

For non A,C,G,T,N characters, skani just turns them into A's. This may lower the ANI a bit.

BTW: see https://github.com/bluenote-1577/skani/wiki/skani-cookbook#all-to-all-comparison-with-lots-of-small-contigs-eg-viruses-plasmids for parameters for small genomes/viruses. May be of use...

nrminor commented 8 months ago

Hi @bluenote-1577,

Thanks for the quick reply, and good to know that N bases are accounted for somehow. I suppose it makes sense that it may lower or increase ANI depending on whether N's mask what would be matches or mismatches. Fwiw, my tool does away with query sequences with more than 10% ambiguous bases—~3,000 Ns for SARS-CoV-2—and the reference sequence contains no Ns.

When you have a chance, would you mind directing me to where in your source code you handle N's? As an aside, it has been surprisingly difficult to find distance/identity metrics that explicitly account for non-ATGC bases—it seems like a lot of Hamming of Levenshtein distances used in bioinformatics just count them as mismatches!

Thanks again, --Nick

bluenote-1577 commented 8 months ago

Hi Nick

See the code here: https://github.com/bluenote-1577/skani/blob/main/src/seeding.rs#L279. Skani avoids sampling k-mers when it sees an N. You can see that non ACGTN characters are not accounted for in the code, though -- they just get turned into A.

Yeah handling Ns is a weird issue. Ideally you'd like to just skip them but it gets complicated... wish there were more of a standard too.

Thanks,

Jim