ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
374 stars 67 forks source link

fastANI skips reference genomes #58

Closed AlmogAngel closed 4 years ago

AlmogAngel commented 4 years ago

Hi, thank you for this wonderful tool.

I've been using fastANI for a long period now and I've noticed a bug which I find hard to explain or demonstrate, but I will do my best :) :

When I use fastANI on multi-threading with long reference and query list, seems like it skips some reference genomes (they do not appear in the result file at all).

For example, when I use one genome vs. all my reference database (~117K genomes) with -t 1 I get back 2946 hits with ANI >= 95 (same species).

However, when I take multiple genomes (~3000) to compare with my reference, the same genome from the previous example gets only 2780 hits with ANI >=95 and I couldn't find the remaining 166 anywhere in the results.

To validate that they indeed have an ANI value, I ran the same genome again with the 166 missing hits (-t 1) and I got back the appropriate ANI results (~98 ANI).

In addition, I was trying to split my reference dataset into files with 5K genomes, but the problem remains.

I will be glad to provide more information if needed, thanks!

cjain7 commented 4 years ago

Is this occurring with FastANI v1.3?

AlmogAngel commented 4 years ago

Is this occurring with FastANI v1.3?

No, I'm still working with v1.1 because of my primary results based on this version.

I am running the same analysis with v1.3 to check if it solves the problem.

cjain7 commented 4 years ago

Cool, I fixed some bugs in 1.3, hopefully this would resolve.

HAugustijn commented 4 years ago

I encounter the same problem while using version 1.3. Everytime I run fastANI, I get a different amount of output

AlmogAngel commented 4 years ago

Cool, I fixed some bugs in 1.3, hopefully this would resolve.

Unfortunately, v1.3 did not solve the bug.

Results summary: version 1.1: 1 genome vs. 117K genomes with 1 thread -> 2946 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2780 hits (>= 95 ANI)

version 1.3: 1 genome vs. 117K genomes with 1 thread -> 2962 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2944 hits (>= 95 ANI)

I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread

cjain7 commented 4 years ago

Thanks for sharing the details. I'll try to figure the cause this week. Meanwhile, please feel free to share the genomes (a small set if possible) where I can reproduce this. I also have some data from #57

cjain7 commented 4 years ago

I've one follow-up question here, are you using different chunk size parameter for the two runs (1 vs. all and 3000 vs. all) ?

AlmogAngel commented 4 years ago

Thanks for sharing the details. I'll try to figure the cause this week. Meanwhile, please feel free to share the genomes (a small set if possible) where I can reproduce this. I also have some data from #57

I am afraid that you cannot reproduce the problem with a small set of genomes, it occurs when using large number of genomes in query/reference.

I've one follow-up question here, are you using different chunk size parameter for the two runs (1 vs. all and 3000 vs. all) ?

What do you mean by the chunk size parameter? In order to chunk my reference I use bash command split -l 5000 to the reference list file and then I ran all of them in parallel, once finish I reunite the results.

cjain7 commented 4 years ago

Got it! I had misinterpreted the meaning of 5K genome chunks earlier.

cjain7 commented 4 years ago

I'm still trying to reason why this could be happening... for now I'm guessing there may be integer overflow happening in the code.. I may be using int type for storing something, and that value could be exceeding 2 billion with large genome counts.

In your run with 117K genomes, can you give me some more statistics to narrow down the problem:

1) What is the total count of contigs in 117K genomes? For example, genome 1 may have 5 contigs, genome 2 may have 30 contigs and so on.. I'm looking for the total sum: (i.e., 5 + 30 + ...)

2) Can you also share what happens if you run with 2 or 4 threads? You mentioned that you get best output with single thread. I want to know how unreliable the results become more #threads. I'm wondering if there is a bug in multi-threading.

3) Please send the complete output log associated with your experiment 1 genome vs. 117K genomes with 1 thread and 1 genome vs. 117K genomes with >1 threads. If the ANI output file is not too big, I'd also like those. Please feel free to dump all these in a single folder and send me by email if it's more convenient.

Please make sure you're using v1.3 for the above. Thanks again for your help!

AlmogAngel commented 4 years ago

I'm still trying to reason why this could be happening... for now I'm guessing there may be integer overflow happening in the code.. I may be using int type for storing something, and that value could be exceeding 2 billion with large genome counts.

In your run with 117K genomes, can you give me some more statistics to narrow down the problem:

  1. What is the total count of contigs in 117K genomes? For example, genome 1 may have 5 contigs, genome 2 may have 30 contigs and so on.. I'm looking for the total sum: (i.e., 5 + 30 + ...)
  2. Can you also share what happens if you run with 2 or 4 threads? You mentioned that you get best output with single thread. I want to know how unreliable the results become more #threads. I'm wondering if there is a bug in multi-threading.
  3. Please send the complete output log associated with your experiment 1 genome vs. 117K genomes with 1 thread and 1 genome vs. 117K genomes with >1 threads. If the ANI output file is not too big, I'd also like those. Please feel free to dump all these in a single folder and send me by email if it's more convenient.

Please make sure you're using v1.3 for the above. Thanks again for your help!

I got the results, sending you via email.

cindy-tu commented 4 years ago

Hello,

Thank you for the wonderful tool!

I don't know if it helps, but I just want to report that I am experiencing the same/similar thing with 4 threads, and provide more data for your reference.

I did 2 trials using the exact same command with the exact same input. I did 1x500 comparisons repeated for 500 queries and concatenated the result. I am using v1.3. Unfortunately, I don't have the log files anymore.

fastANI -q query.fasta --rl ref_files.txt --fragLen 1500 -o query.out -t 4

Both 1st and 2nd trial gave 89,694 rows of result (the discrepancy was much greater with version 1.1). Below I provide the differences between the two files.

Here are rows found in trial 1 but not in trial 2

header: query seq, ref seq, ANI value, orthologous fragment count, total fragment count

cpxv-jagkre08_1       vacv-lister_btn       97.6994 65      144
cpxv-marlei07_1       mpxv-pch      95.7457 126     141
cpxv-marlei07_1       vacv-lister   96.6591 122     141
**myxv-coomandook_13    myxv-coomandook_13    100       101     108**
myxv-coomandook_13    myxv-cornwall 99.8764 100     108
myxv-olympic_dam_2015   myxv-tol08_18 99.8012 101       108
myxv-perthshire_1818  myxv-port_augusta_2014        99.798  100       107
vacv-agr_mva_572pre   vacv-agr_mva_572pre   100       32      110
vacv-agr_mva_572pre   vacv-dryvax-dpp10     99.1093 56      110
**vacv-l-ivp_14   varv-ben68    96.2336   117     129**
vacv-lc16m8   cmlv-151v     97.1137 96      126
vacv-lc16m8   cpxv-k780     97.9983 116     126
vacv-lc16m8     cpxv-kos_2015 97.9747 49        126
vacv-lister   varv-gin69    96.291    117     126
varv-bgd74_nur        varv-eth72_16 99.879  123       124
varv-ner69    vacv-cva      96.6914 82      124
varv-ner69    vacv-dryvax-dpp21     96.5086 94      124
varv-tza65    vacv-dryvax-dpp11     96.3506 115     123
varv-tza65    vacv-ihdw1    96.2555 117     123
varv-tza65    varv-pak_1969 99.8979 122     123
varv-zaf65_102        varv-jpn46_yam        99.8268 38      124

Here are rows found in trial 2 but not in trial 1

**cpxv-humber07_1       vacv-bra_serro_2      97.4106 119     134**
cpxv-jagkre08_1       vacv-lister_btn       97.7175 123     144
cpxv-marlei07_1       mpxv-pch      96.3041 102     141
cpxv-marlei07_1       vacv-lister   96.56     64      141
**mpxv-sle      vacv-tp5      96.9057 116     132**
myxv-coomandook_13    myxv-cornwall 99.9796 101     108
myxv-olympic_dam_2015   myxv-tol08_18 99.7989 50        108
myxv-perthshire_1818  myxv-port_augusta_2014        99.6614 32      107
vacv-agr_mva_572pre   vacv-agr_mva_572pre   100       108     110
vacv-agr_mva_572pre   vacv-dryvax-dpp10     99.1229 108     110
vacv-lc16m8   cmlv-151v     96.7615 119     126
vacv-lc16m8   cpxv-k780     97.6604 118     126
vacv-lc16m8   cpxv-kos_2015 98.0925 122     126
vacv-lister     varv-gin69    95.4958   32      126
varv-bgd74_nur        varv-eth72_16 99.8782   64      124
varv-ner69    vacv-cva      96.3883 114     124
varv-ner69    vacv-dryvax-dpp21     96.2411 119     124
varv-tza65    vacv-dryvax-dpp11     96.4069 114     123
varv-tza65    vacv-ihdw1    96.3677 67      123
varv-tza65    varv-pak_1969 99.9018 117     123
varv-zaf65_102        varv-jpn46_yam        99.8853 124     124

Rows with ** are additional comparison found in one file but not the other.

Additionally, while the ANI-values may not differ that much for some, there's greater differences with the orthologous fragment count. E.g. vacv-lister -> varv-gin69 has fragment count dropped from 117 to 32 in trial 2.

cjain7 commented 4 years ago

Cool, I fixed some bugs in 1.3, hopefully this would resolve.

Unfortunately, v1.3 did not solve the bug.

Results summary: version 1.1: 1 genome vs. 117K genomes with 1 thread -> 2946 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2780 hits (>= 95 ANI)

version 1.3: 1 genome vs. 117K genomes with 1 thread -> 2962 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2944 hits (>= 95 ANI)

I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread

@AlmogAngel, I've made a fix in the code today.. I am curious to know if that change resolves this issue or not. Would it be possible for you to download the latest code from master branch and run the above experiment again? Thank you!

AlmogAngel commented 4 years ago

Cool, I fixed some bugs in 1.3, hopefully this would resolve.

Unfortunately, v1.3 did not solve the bug. Results summary: version 1.1: 1 genome vs. 117K genomes with 1 thread -> 2946 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2780 hits (>= 95 ANI) version 1.3: 1 genome vs. 117K genomes with 1 thread -> 2962 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2944 hits (>= 95 ANI) I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread

@AlmogAngel, I've made a fix in the code today.. I am curious to know if that change resolves this issue or not. Would it be possible for you to download the latest code from master branch and run the above experiment again? Thank you!

Unfortunately, the new code performed worse.

Here are my analysis results: 1) 1 genome vs. 117K with a different number of threads (v1.3): <# of threads> <# of hits> 1 5546 2 5546 4 5546 7 5546 10 5546 15 5546 17 5546 20 5519 90 4401

2) I made another try with 20 threads since I suspected it to be the cutoff for bugs to appear. However, this time I got 5546 hits instead of 5519.

3) I made the same run with 90 threads, this time with the new code you just uploaded in the master branch and I got 3639 hits this time.

cjain7 commented 4 years ago

I see,

what happens when you keep #threads fixed, and do 1 genome vs. 117K genomes 3000 queries (including the same genome) vs. 117K?

bkille commented 4 years ago

I too am experiencing this behavior. I've noticed something odd as well. I am doing all pairs ANI calculations using a file of 47 genomes. When running with 1 thread, I get the expected results every time. If I run with 60 threads, I am at most missing 1 line from the final output file. Same if I run with 2 threads. However if I run with 30 threads, then I start missing 100 or more lines from the output most times.

cjain7 commented 4 years ago

@bkille , possible to share the data and your scripts?

bkille commented 4 years ago

@cjain7, sorry for the delayed response.

mers49.zip Running fastANI --rl mers_files.txt --ql mers_files.txt -o fastani_out.txt with different thread counts.

(PS is there a way to tell fastANI to do pairwise calculations from one input list? It would save me half the time)

cjain7 commented 4 years ago

@bkille

I tested fastANI on the genomes you shared above on two different clusters I've access to. I varied thread counts to 1, 2, 4, 8, 16, 32, 64 and 128. In each case, i'm getting consistent output.

$ wc -l */output.txt
   2304 t_128/output.txt
   2304 t_16/output.txt
   2304 t_1/output.txt
   2304 t_2/output.txt
   2304 t_32/output.txt
   2304 t_4/output.txt
   2304 t_64/output.txt
   2304 t_8/output.txt

May be you can try running on a different computer at your end... i'm not sure what's going on. A similar issue was reported in #37

cjain7 commented 4 years ago

I also checked random pairs of genomes in each output file (corresponding to different thread counts), I'm seeing consistent output values at my end.

$ grep -E "Wadi-Ad-Dawasir_1_2013.*Al-Hasa_17_2013.fna" t_*/output.txt

t_128/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna  /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna  99.7004 10  10
t_16/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna   /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna  99.7004 10  10
t_1/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna    /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna  99.7004 10  10
t_2/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna    /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna  99.7004 10  10
t_32/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna   /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna  99.7004 10  10
t_4/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna    /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna  99.7004 10  10
t_64/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna   /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna  99.7004 10  10
t_8/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna    /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna  99.7004 10  10

PS: I'm using the latest version of FastANI directly cloned from master branch.

$ ./FastANI/fastANI -v
version 1.3
cjain7 commented 4 years ago

See #67, this issue should be fixed by the code change 902ce0abe9624c8d52fa1b94767064ca596ce379 (cc @cindy-tu , @AlmogAngel, @HAugustijn)