ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
368 stars 66 forks source link

Results missing from output #37

Closed luke-dt closed 4 years ago

luke-dt commented 5 years ago

I'm performing a large all vs all ANI calculation, and it appears that some results that should be there are missing from the output. For example, when comparing bacteria A vs. bacteria B, the resulting ANI is 87%, but when the converse comparison is made (B vs. A), the result is entirely missing from the output file. Here's a piece of a heatmap illustrating this:

capture

Notice the blue dots (values equal to 0) dispersed asymmetrically in areas of otherwise high ANI

cjain7 commented 5 years ago

Thanks for sharing these results. While experimenting with FastANI, we had also noticed a few rare cases where the difference was significant. But when we had analyzed them further, it had turned out that those particular cases had significant amount of contamination (the analysis is available in the supplementary document of our paper). I'm wondering if that's true here as well. Are these genomes publicly available?

luke-dt commented 5 years ago

Thanks for sharing! Unfortunately, most of these genomes are not publicly available. They are environmental isolates, and while the potential for contamination is not insignificant, I have some cases where I am confident that the assemblies are not contaminated. In looking into this a little more, I also realized that I'm having a consistency issue with fastANI. I ran fast ANI 3 times (all vs. all) with my ~220 genomes, which resulted in ~27000 significant comparisons being reported. Of those, there are ~180 comparisons that have some variation in the output. As an example, here is a self comparison that happened to show variation between runs

Delftia_acidovorans_NBRC_14950 vs Delftia_acidovorans_NBRC_14950 ani | f.aligned | f.total | percent.aligned 98.3451 | 510 | 2167 | 23.534841 100.0000 | 2161 | 2167 | 99.723120 100.0000 | 2161 | 2167 | 99.723120

This was done with a publicly available a type strain assembly: https://www.ncbi.nlm.nih.gov/assembly/GCF_001598795.1

cjain7 commented 5 years ago

I could not reproduce the inconsistency issue with the publicly available strain you suggest. I did 100 comparisons, and got consistent output each time.

I'm using the latest source code on the master branch of repository. Below I've pasted my terminal log. Do you suggest a different recipe to reproduce your issue?

[jainc2@gry-compute076 Issue-37]$ ls
GCF_001598795.1_ASM159879v1_genomic.fna

[jainc2@gry-compute076 Issue-37]$ for ((i=1;i<=10;i++)); do echo "GCF_001598795.1_ASM159879v1_genomic.fna" >> list; done [jainc2@gry-compute076 Issue-37]$ cat list GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna

[jainc2@gry-compute076 Issue-37]$ ../fastANI --ql list --rl list -t 16 -o 100_runs.out &> LOG

[jainc2@gry-compute076 Issue-37]$ grep -c "100" 100_runs.out 100 [jainc2@gry-compute076 Issue-37]$ grep -c "2161" 100_runs.out 100 [jainc2@gry-compute076 Issue-37]$ grep -c "2167" 100_runs.out 100 [jainc2@gry-compute076 Issue-37]$ head -n 3 100_runs.out GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna 100 2161 2167 GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna 100 2161 2167 GCF_001598795.1_ASM159879v1_genomic.fna GCF_001598795.1_ASM159879v1_genomic.fna 100 2161 2167`

luke-dt commented 5 years ago

Thanks for looking into this! I think the problem that I'm having is the scale that I'm running fastani at (on the order of 100s of genomes). Using the publicly available assembly I posted earlier, I ran 16384 (128^2) pairwise self comparisons using fastani v1.1. The following was run across 128 nodes:

fastANI -q ${sample} --rl ${sampleList} --threads 8 --output ${outdir}/${ARRAY_TASK_ID}_ani.temp

sampleList is a text file containing the file path to the .fasta file repeated 128 times.I then cat-ed all the outputs into one file:

cat *.temp >> allOut.tsv

The resulting output contains 16160 rows (missing 224 observations), with 2161 fragments aligned appearing 16146 times. The other 14 rows show significantly fewer aligned fragments:

capture

I realize this is a very small percentage compared to the rest of the comparisons, but it's an issue when doing large numbers of comparisons since the erroneous values appear to pop up at random. Do you have any suggestions on dealing with this issue?

Thank you!

cjain7 commented 5 years ago

Hi, I tried reproducing this issue with 128x128 comparisons but failed. This makes it tough to figure out what's going on..

luke-dt commented 5 years ago

Hi, I returned to this after putting it on the back burner for a little bit. Reinstalling the module on our cluster seemed to have fixed the problem. I'm still not sure what was going on before, we just reinstalled the same version.

Thanks for the all the help!

luke-dt commented 4 years ago

I wanted to reopen this because I'm seeing some weird behavior with fastANI that is similar to what I was seeing before (and a bit like #58) . I'm trying to run fastANI across many nodes so that an all_v_all comparison becomes many one_v_alls, which I can then cat together. However, depending on how I run the program, I get different results. I did all the following experiments with the NCBI reference genomes linked above. list.txt is just this genome repeated 128 times:

1) All vs All, 8 threads I run the following command, which gives me the expected 16384 hits

fastANI --ql list.txt --rl list.txt -o ${outdir}/output_128v128.txt -t 8 --matrix

2) One vs All, 8 threads I run the following command across 128 nodes, which gives me 16282 hits (missing 102). I tried this multiple times and there is some variability; it's not always 102, and has been as high as 150.

fastANI -q ${sample} --rl list.txt -o ${outdir}/output_${SLURM_ARRAY_TASK_ID}.txt -t 8

3) One vs All, 1 thread I run the following command across 128 nodes, which gives me the expected 16384 hits

fastANI -q ${sample} --rl list.txt -o ${outdir}/output_${SLURM_ARRAY_TASK_ID}.txt -t 1

This is using v1.3

cjain7 commented 4 years ago

Could you try using the latest code on master branch? The instructions to download and compile the code are listed in README. I had fixed a similar issue a while ago.

luke-dt commented 4 years ago

I downloaded and compiled from the master branch, and I'm having the same issue.

cjain7 commented 4 years ago

Thanks for letting me know. I'll try to reproduce this problem at my end. Will get back to you soon.

cjain7 commented 4 years ago

reproduce.zip

Please check the above folder to see my scripts and output. I am unable to reproduce above behavior at my end. Do you have any suggestions for me or any changes I can make to reproduce the issue?

cat one2all/*.txt | wc -l = 16384 cat all2all/*.txt | wc -l = 16384

I'm running the code from master branch.

luke-dt commented 4 years ago

I looked through your results, and couldn't find any issues with your implementation. The only difference I can think of is the physical system that I'm running FastANI on (I use a computer cluster managed by SLRUM).

I'm still having some issues, but running with <8 threads seems to have solved the issue. I don't see a huge performance increase by using >8 threads, anyway:

threads |  hits | real time (min) | cpu time (min)
1       | 91770 |      1189.2     |      1238.9
2       | 91770 |       658.2     |      1349.1
4       | 91770 |       548.4     |      1546.4
8       | 91266 |       287.7     |      1700.5
12      | 91406 |       266.1     |      2088.8
cjain7 commented 4 years ago

I can redo my analysis on SLURM. But you should make sure that you are demanding appropriate CPU resources in your slurm commands. For example, if you requested one core, multi-threading becomes useless.

cjain7 commented 4 years ago

Here is an example (using 28 threads): https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7287288/How-to+Submit+an+OpenMP+Job

Note the --cpus-per-task and OMP_NUM_THREADS parameters.

luke-dt commented 4 years ago

Thanks for looking into this! It's not urgent for me, since limiting the number of threads seems to have resolved the issue for me.

And thanks for the link! The 1 thread job was just because I was curious what would happen.

cjain7 commented 4 years ago

See #67, this issue should be fixed by the code change 902ce0abe9624c8d52fa1b94767064ca596ce379