Closed AlmogAngel closed 4 years ago
Is this occurring with FastANI v1.3?
Is this occurring with FastANI v1.3?
No, I'm still working with v1.1 because of my primary results based on this version.
I am running the same analysis with v1.3 to check if it solves the problem.
Cool, I fixed some bugs in 1.3, hopefully this would resolve.
I encounter the same problem while using version 1.3. Everytime I run fastANI, I get a different amount of output
Cool, I fixed some bugs in 1.3, hopefully this would resolve.
Unfortunately, v1.3 did not solve the bug.
Results summary: version 1.1: 1 genome vs. 117K genomes with 1 thread -> 2946 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2780 hits (>= 95 ANI)
version 1.3: 1 genome vs. 117K genomes with 1 thread -> 2962 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2944 hits (>= 95 ANI)
I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread
Thanks for sharing the details. I'll try to figure the cause this week. Meanwhile, please feel free to share the genomes (a small set if possible) where I can reproduce this. I also have some data from #57
I've one follow-up question here, are you using different chunk size parameter for the two runs (1 vs. all and 3000 vs. all) ?
Thanks for sharing the details. I'll try to figure the cause this week. Meanwhile, please feel free to share the genomes (a small set if possible) where I can reproduce this. I also have some data from #57
I am afraid that you cannot reproduce the problem with a small set of genomes, it occurs when using large number of genomes in query/reference.
I've one follow-up question here, are you using different chunk size parameter for the two runs (1 vs. all and 3000 vs. all) ?
What do you mean by the chunk size parameter?
In order to chunk my reference I use bash command split -l 5000
to the reference list file and then I ran all of them in parallel, once finish I reunite the results.
Got it!
I had misinterpreted the meaning of 5K genome chunks
earlier.
I'm still trying to reason why this could be happening... for now I'm guessing there may be integer overflow happening in the code.. I may be using int
type for storing something, and that value could be exceeding 2 billion with large genome counts.
In your run with 117K genomes, can you give me some more statistics to narrow down the problem:
1) What is the total count of contigs in 117K genomes? For example, genome 1 may have 5 contigs, genome 2 may have 30 contigs and so on.. I'm looking for the total sum: (i.e., 5 + 30 + ...)
2) Can you also share what happens if you run with 2 or 4 threads? You mentioned that you get best output with single thread. I want to know how unreliable the results become more #threads. I'm wondering if there is a bug in multi-threading.
3) Please send the complete output log associated with your experiment 1 genome vs. 117K genomes with 1 thread
and 1 genome vs. 117K genomes with >1 threads
. If the ANI output file is not too big, I'd also like those. Please feel free to dump all these in a single folder and send me by email if it's more convenient.
Please make sure you're using v1.3 for the above. Thanks again for your help!
I'm still trying to reason why this could be happening... for now I'm guessing there may be integer overflow happening in the code.. I may be using
int
type for storing something, and that value could be exceeding 2 billion with large genome counts.In your run with 117K genomes, can you give me some more statistics to narrow down the problem:
- What is the total count of contigs in 117K genomes? For example, genome 1 may have 5 contigs, genome 2 may have 30 contigs and so on.. I'm looking for the total sum: (i.e., 5 + 30 + ...)
- Can you also share what happens if you run with 2 or 4 threads? You mentioned that you get best output with single thread. I want to know how unreliable the results become more #threads. I'm wondering if there is a bug in multi-threading.
- Please send the complete output log associated with your experiment
1 genome vs. 117K genomes with 1 thread
and1 genome vs. 117K genomes with >1 threads
. If the ANI output file is not too big, I'd also like those. Please feel free to dump all these in a single folder and send me by email if it's more convenient.Please make sure you're using v1.3 for the above. Thanks again for your help!
I got the results, sending you via email.
Hello,
Thank you for the wonderful tool!
I don't know if it helps, but I just want to report that I am experiencing the same/similar thing with 4 threads, and provide more data for your reference.
I did 2 trials using the exact same command with the exact same input. I did 1x500 comparisons repeated for 500 queries and concatenated the result. I am using v1.3. Unfortunately, I don't have the log files anymore.
fastANI -q query.fasta --rl ref_files.txt --fragLen 1500 -o query.out -t 4
Both 1st and 2nd trial gave 89,694 rows of result (the discrepancy was much greater with version 1.1). Below I provide the differences between the two files.
Here are rows found in trial 1 but not in trial 2
header: query seq, ref seq, ANI value, orthologous fragment count, total fragment count
cpxv-jagkre08_1 vacv-lister_btn 97.6994 65 144
cpxv-marlei07_1 mpxv-pch 95.7457 126 141
cpxv-marlei07_1 vacv-lister 96.6591 122 141
**myxv-coomandook_13 myxv-coomandook_13 100 101 108**
myxv-coomandook_13 myxv-cornwall 99.8764 100 108
myxv-olympic_dam_2015 myxv-tol08_18 99.8012 101 108
myxv-perthshire_1818 myxv-port_augusta_2014 99.798 100 107
vacv-agr_mva_572pre vacv-agr_mva_572pre 100 32 110
vacv-agr_mva_572pre vacv-dryvax-dpp10 99.1093 56 110
**vacv-l-ivp_14 varv-ben68 96.2336 117 129**
vacv-lc16m8 cmlv-151v 97.1137 96 126
vacv-lc16m8 cpxv-k780 97.9983 116 126
vacv-lc16m8 cpxv-kos_2015 97.9747 49 126
vacv-lister varv-gin69 96.291 117 126
varv-bgd74_nur varv-eth72_16 99.879 123 124
varv-ner69 vacv-cva 96.6914 82 124
varv-ner69 vacv-dryvax-dpp21 96.5086 94 124
varv-tza65 vacv-dryvax-dpp11 96.3506 115 123
varv-tza65 vacv-ihdw1 96.2555 117 123
varv-tza65 varv-pak_1969 99.8979 122 123
varv-zaf65_102 varv-jpn46_yam 99.8268 38 124
Here are rows found in trial 2 but not in trial 1
**cpxv-humber07_1 vacv-bra_serro_2 97.4106 119 134**
cpxv-jagkre08_1 vacv-lister_btn 97.7175 123 144
cpxv-marlei07_1 mpxv-pch 96.3041 102 141
cpxv-marlei07_1 vacv-lister 96.56 64 141
**mpxv-sle vacv-tp5 96.9057 116 132**
myxv-coomandook_13 myxv-cornwall 99.9796 101 108
myxv-olympic_dam_2015 myxv-tol08_18 99.7989 50 108
myxv-perthshire_1818 myxv-port_augusta_2014 99.6614 32 107
vacv-agr_mva_572pre vacv-agr_mva_572pre 100 108 110
vacv-agr_mva_572pre vacv-dryvax-dpp10 99.1229 108 110
vacv-lc16m8 cmlv-151v 96.7615 119 126
vacv-lc16m8 cpxv-k780 97.6604 118 126
vacv-lc16m8 cpxv-kos_2015 98.0925 122 126
vacv-lister varv-gin69 95.4958 32 126
varv-bgd74_nur varv-eth72_16 99.8782 64 124
varv-ner69 vacv-cva 96.3883 114 124
varv-ner69 vacv-dryvax-dpp21 96.2411 119 124
varv-tza65 vacv-dryvax-dpp11 96.4069 114 123
varv-tza65 vacv-ihdw1 96.3677 67 123
varv-tza65 varv-pak_1969 99.9018 117 123
varv-zaf65_102 varv-jpn46_yam 99.8853 124 124
Rows with ** are additional comparison found in one file but not the other.
Additionally, while the ANI-values may not differ that much for some, there's greater differences with the orthologous fragment count. E.g. vacv-lister -> varv-gin69 has fragment count dropped from 117 to 32 in trial 2.
Cool, I fixed some bugs in 1.3, hopefully this would resolve.
Unfortunately, v1.3 did not solve the bug.
Results summary: version 1.1: 1 genome vs. 117K genomes with 1 thread -> 2946 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2780 hits (>= 95 ANI)
version 1.3: 1 genome vs. 117K genomes with 1 thread -> 2962 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2944 hits (>= 95 ANI)
I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread
@AlmogAngel, I've made a fix in the code today.. I am curious to know if that change resolves this issue or not. Would it be possible for you to download the latest code from master branch and run the above experiment again? Thank you!
Cool, I fixed some bugs in 1.3, hopefully this would resolve.
Unfortunately, v1.3 did not solve the bug. Results summary: version 1.1: 1 genome vs. 117K genomes with 1 thread -> 2946 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2780 hits (>= 95 ANI) version 1.3: 1 genome vs. 117K genomes with 1 thread -> 2962 hits (>= 95 ANI) ~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2944 hits (>= 95 ANI) I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread
@AlmogAngel, I've made a fix in the code today.. I am curious to know if that change resolves this issue or not. Would it be possible for you to download the latest code from master branch and run the above experiment again? Thank you!
Unfortunately, the new code performed worse.
Here are my analysis results: 1) 1 genome vs. 117K with a different number of threads (v1.3): <# of threads> <# of hits> 1 5546 2 5546 4 5546 7 5546 10 5546 15 5546 17 5546 20 5519 90 4401
2) I made another try with 20 threads since I suspected it to be the cutoff for bugs to appear. However, this time I got 5546 hits instead of 5519.
3) I made the same run with 90 threads, this time with the new code you just uploaded in the master branch and I got 3639 hits this time.
I see,
what happens when you keep #threads fixed, and do 1 genome vs. 117K genomes 3000 queries (including the same genome) vs. 117K?
I too am experiencing this behavior. I've noticed something odd as well. I am doing all pairs ANI calculations using a file of 47 genomes. When running with 1 thread, I get the expected results every time. If I run with 60 threads, I am at most missing 1 line from the final output file. Same if I run with 2 threads. However if I run with 30 threads, then I start missing 100 or more lines from the output most times.
@bkille , possible to share the data and your scripts?
@cjain7, sorry for the delayed response.
mers49.zip
Running fastANI --rl mers_files.txt --ql mers_files.txt -o fastani_out.txt
with different thread counts.
(PS is there a way to tell fastANI to do pairwise calculations from one input list? It would save me half the time)
@bkille
I tested fastANI on the genomes you shared above on two different clusters I've access to. I varied thread counts to 1, 2, 4, 8, 16, 32, 64 and 128. In each case, i'm getting consistent output.
$ wc -l */output.txt
2304 t_128/output.txt
2304 t_16/output.txt
2304 t_1/output.txt
2304 t_2/output.txt
2304 t_32/output.txt
2304 t_4/output.txt
2304 t_64/output.txt
2304 t_8/output.txt
May be you can try running on a different computer at your end... i'm not sure what's going on. A similar issue was reported in #37
I also checked random pairs of genomes in each output file (corresponding to different thread counts), I'm seeing consistent output values at my end.
$ grep -E "Wadi-Ad-Dawasir_1_2013.*Al-Hasa_17_2013.fna" t_*/output.txt
t_128/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna 99.7004 10 10
t_16/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna 99.7004 10 10
t_1/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna 99.7004 10 10
t_2/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna 99.7004 10 10
t_32/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna 99.7004 10 10
t_4/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna 99.7004 10 10
t_64/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna 99.7004 10 10
t_8/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna /home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna 99.7004 10 10
PS: I'm using the latest version of FastANI directly cloned from master branch.
$ ./FastANI/fastANI -v
version 1.3
See #67, this issue should be fixed by the code change 902ce0abe9624c8d52fa1b94767064ca596ce379 (cc @cindy-tu , @AlmogAngel, @HAugustijn)
Hi, thank you for this wonderful tool.
I've been using fastANI for a long period now and I've noticed a bug which I find hard to explain or demonstrate, but I will do my best :) :
When I use fastANI on multi-threading with long reference and query list, seems like it skips some reference genomes (they do not appear in the result file at all).
For example, when I use one genome vs. all my reference database (~117K genomes) with -t 1 I get back 2946 hits with ANI >= 95 (same species).
However, when I take multiple genomes (~3000) to compare with my reference, the same genome from the previous example gets only 2780 hits with ANI >=95 and I couldn't find the remaining 166 anywhere in the results.
To validate that they indeed have an ANI value, I ran the same genome again with the 166 missing hits (-t 1) and I got back the appropriate ANI results (~98 ANI).
In addition, I was trying to split my reference dataset into files with 5K genomes, but the problem remains.
I will be glad to provide more information if needed, thanks!