ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
374 stars 67 forks source link

Sequences don't fully match themselves #132

Open fluhus opened 5 months ago

fluhus commented 5 months ago

Hello, thanks for making this tool!

As a sanity test before I incorporate it into my pipeline I aligned a collection of viral genomes (~10K+ bases each) against themselves. To my surprise, 35% of the sequences did not have a perfect match.

For example with the attached file below, running fastANI -q vir.fa -r vir.fa -o /dev/stdout gave:

vir.fa     vir.fa     100     3       4

I am seeing 100% base identity but 3 out of 4 chunks matched. Is that correct? Does that mean 100% * 3 / 4 = 75% match? How can I distinguish this case from a genome that's actually 25% shorter but matches 100%? Maybe I am misinterpreting the results?

I hope my question is clear :)

vir.fa.gz

valery-shap commented 2 months ago

Hello, @fluhus,

This topic is interesting for me too. I have nearly the same situation with bacterial genomes ,especially if a value of fraglen was changed from default (3000) to 1020. ANC_3681.fasta ANC_3681.fasta 99.9992 3432 3467 for fraglen=1020 ANC_3681.fasta ANC_3681.fasta 100 1169 1177 for fraglen=3000