ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
374 stars 67 forks source link

FastANI gives different results depending on genomes in the reference list #46

Closed donovan-h-parks closed 5 years ago

donovan-h-parks commented 5 years ago

Hello,

Thank you for FastANI. We are using it regularly in our work. I have run into some unexpected behaviour where FastANI does not appear to give consistent results. I have a query genome Q and the reported ANI to a given reference genome R changes depending on what genomes I have in the reference list.

That is, fastANI -q Q.fna -r R.fna -o single.tsv

Gives a different result to: fastANI -q Q.fna --rl references.lst -o multiple.tsv

single.tsv gives: Q.fna R.fna 97.0547 1150 1325

The relevant line in multiple.tsv gives: Q.fna R.fna 97.0342 1152 1325

Why is the report ANI and number of alignment fragments different? The results change slightly as I modify the genomes in references.lst. Is this the expected behaviour? If so, it would be helpful to note this heuristic quality of FastANI in the README since these small difference do change assignments in a small number of cases when processing large genome databases which leads to confusion.

cjain7 commented 5 years ago

Let me try to reproduce this to see why this is happening.. May I know how many genomes do you have in rerferences.lst?

donovan-h-parks commented 5 years ago

Probably around 50. I'm away from the office for a week, but can send you the genomes if you aren't able to reproduce the issue.

On Fri, Jul 12, 2019, 3:05 PM Chirag Jain, notifications@github.com wrote:

Let me try to reproduce this to see why this is happening.. May I know how many genomes do you have in rerferences.lst?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/FastANI/issues/46?email_source=notifications&email_token=AA4EPEBH6DTMXTSCO746WSDP7D52LA5CNFSM4H7LDWBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ27TUA#issuecomment-511048144, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4EPEGZB55CGSOJAOVM2MTP7D52LANCNFSM4H7LDWBA .

donovan-h-parks commented 5 years ago

Hello. Any luck in reproducing this issue? It is a bit of a concern on our end since we process large volumes of genomes and use strict cutoffs to make decisions. As such, we do run into cases where the small differences between these two modes of running FastANI lead to different results.

cjain7 commented 5 years ago

Hey, yes, I'm able to reproduce it... looking into it.

cjain7 commented 5 years ago

Hi, I've made couple of fixes for this issue.. Could you re-try FastANI with the latest code on master branch? I'll mark a new version if it works out fine.

donovan-h-parks commented 5 years ago

Hello. I don't have an easy way to compile this code. The system I am on is still running gcc 4.6.3. Can you provide me with a Linux binary to test?

cjain7 commented 5 years ago

Here you go. fastANI.zip

donovan-h-parks commented 5 years ago

I can confirm I am getting the same ANI and AF when doing a single comparison or when doing multiple comparisons via a reference list. The new result does differ slightly from both the previous values I was getting though: 97.0536 1150 1325

cjain7 commented 5 years ago

Thanks for the update! Yes it will differ, mainly because FastANI v1.1 was dropping high-frequency kmers (top 0.001%) in ref. DB to optimize for speed. I removed this optimization as it would also contribute to inconsistent results when comparing one vs. thousand genomes.