Sequences don't fully match themselves

ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation

Apache License 2.0

374 stars 67 forks source link

Hello, thanks for making this tool!

As a sanity test before I incorporate it into my pipeline I aligned a collection of viral genomes (~10K+ bases each) against themselves. To my surprise, 35% of the sequences did not have a perfect match.

For example with the attached file below, running fastANI -q vir.fa -r vir.fa -o /dev/stdout gave:

vir.fa     vir.fa     100     3       4

I am seeing 100% base identity but 3 out of 4 chunks matched. Is that correct? Does that mean 100% * 3 / 4 = 75% match? How can I distinguish this case from a genome that's actually 25% shorter but matches 100%? Maybe I am misinterpreting the results?

I hope my question is clear :)

vir.fa.gz

ParBLiSS / FastANI

Sequences don't fully match themselves #132