Closed apcamargo closed 4 years ago
Thanks for sharing this problem.
I've noticed multiple issues that highlight this problem (also see #37 and #58 ); however i've failed to reproduce this issue on our compute clusters unfortunately. I'm willing to invest time into this problem, but need help so I can reproduce this behavior at my end for debugging.
Are you able to provide more details (e.g, mac/linux, gcc version, input data files) etc.. ?
I got this issue with both the Conda version (I believe they use GCC 7.*) and a statically compiled version in my personal computer (master branch, Ubuntu 16.04, GCC 7.5.0). I executed the runs in a cluster with SUSE Linux Enterprise Server 15.
By the way, I had a bug while compiling FastANI in my PC and I submitted a PR fixing it: https://github.com/ParBLiSS/FastANI/pull/68
I don't think I can share this specific dataset because it isn't mine. But I'll try to replicate the issue with my own genomes so I can send you the data. I can't promise that I'll be able to do that in the next few days, though.
Just to illustrate the extend of the inconsistency: I executed the all-versus-all comparison eight times and each run had ~16 comparisons that were not found in any of the other ones. I also noticed that this greatly influenced the definition of species (using an algorithm similar to the one used by GTDB).
Got it. Since one of the issue filed previously involved use of SLURM; curious if you too are using SLURM?
Is this is a locally owned cluster? Wondering if you can arrange a temporary account for me (perhaps for a week) ?
Yes, I'm using SLURM.
Unfortunately it is a big shared cluster and I have no control over it, otherwise I'd be happy to give you access to it.
Thanks! I guess the bug might be related to SLURM. When you get chance, can you send me your slurm job script/commands and job output log while running:
ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
/usr/bin/time fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt. #
printenv #please add an extra command for me
cc'ing @luke-dt
#!/bin/bash
#SBATCH --job-name=fastani
#SBATCH --account=fnglanot
#SBATCH --qos=genepool
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --constraint=haswell
ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
srun --cpus-per-task=64 --ntasks=1 /usr/bin/time fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt
printenv
Here's the log: slurm-31935719.txt
I executed the same command twice in a different cluster that uses PBS. For some reason it took much longer for FastANI to finish, but the outputs were different anyway.
wc -l fastani_output_1.txt fastani_output_2.txt
3130 fastani_output_1.txt
3149 fastani_output_2.txt
Here is the script that I run with sbatch:
#SBATCH --job-name=fastani
#SBATCH --mem=30G
#SBATCH --cpus-per-task=4
#SBATCH --output=slurm_out/fastani/z_fastani_%A.out
#SBATCH --error=slurm_out/fastani/z_fastani_%A.out
module load fastani/1.3.1a
basedir="$PWD"
outdir="${basedir}/d03_species_analysis/fastani"
fastANI --ql ${outdir}/genomepaths.txt \
--rl ${outdir}/genomepaths.txt \
-o ${outdir}/fastani_out.txt \
-t ${SLURM_CPUS_PER_TASK}
log file for 4 threads (analysis worked) log file for 8 threads (comparisons missing
After executing with 4 cores I got consistent outputs. However there are some missing comparisons. The output I got from the execution with 4 cores has 3149 lines and a file that I built by aggregating multiple executions has 3271 lines.
Here are ANI vs. % aligned plots for these two files:
It seems that most of the missing comparisons are from pairs with high ANI and low % aligned.
Thanks! I'm able to reproduce inconsistent output at my end on a cluster with SLURM, which is good! Will reach out if I need more info.
I replicated a single publicly-available genome and did a all-to-all among them. For a few pairs, I do see <100% ANI reported in an inconsistent manner. Please give me some time to investigate.
You're welcome!
For further context: to build the first figure I executed FastANI with 64 cores in a PBS cluster and aggregated the results into a single file. For lines in which the first and second genomes were the same but a different ANI was reported, I chose the one with the highest % aligned (which usually corresponded to the lowest ANI).
Hi @apcamargo , @luke-dt ,
Thanks again for your help! There was a bug in my code associated with file-io. I've committed the fix to master branch. When you get chance, please run the code again, and let me know if also fixes the issue at your end. I will create a new fastANI version after I hear from you.
Hi guys (@apcamargo , @luke-dt) let me know if you were able to check.
Hi @cjain7! I just submitted the job and I'll let you know when I get the results.
The SLURM cluster I have access to is in maintenance, so I executed fastANI in a PBS cluster with -t 120
.
$ sha256sum dereplicated_mags_ani_raw_*
05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965 dereplicated_mags_ani_raw_1.txt
05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965 dereplicated_mags_ani_raw_2.txt
05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965 dereplicated_mags_ani_raw_3.txt
05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965 dereplicated_mags_ani_raw_4.txt
05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965 dereplicated_mags_ani_raw_5.txt
The bug seems to be fixed! Thank you @cjain7!
Good to know. Thanks! Closing the issue now. I'll create v1.4; please use that going forward.
Hey @cjain7
Even though the results are now consistent across runs, I noticed that there are still many comparisons missing from the output. I know that fastANI won't report comparisons of genomes with low % of alignment, but some of the missing comparisons were present in previous runs. Is this behaviour expected?
yeah, i think (or at least I hope) that output will be consistent from now onwards. Those cases you mention are probably border-line cases which cleared the ~80% cutoff by a small margin due to previous bug.
The strange thing is that the number of genomes in the output (520) is less than the total number of genomes (522), meaning that there are two genomes that are not being compared with themselves (certainly more than 80%)
Can you check if they have same file names?
Please create a new issue with more information (e.g., log files, input command etc.) if you would like me to look further.
I was just preparing a bug report and a noticed that the bug was in the script I was using to process the output. Sorry for the trouble!
Hello, I'm having the same inconsistency problem but I'm not running FastANI via slurm. I'm running it on a Ubuntu machine and installed it using the compiled version from the master branch.
~/Downloads/FastANI/fastANI --ql filespaths.txt --rl filespaths.txt -t 7 -o ~/Documents/miriam/fastani_results.txt
The output table has different ANI values for the same compared genomes.
Hi @cjain7!
I'm using FastANI to compare a set of approximately 500 MAGs. To do that, I'm executing:
Across multiple runs I observed that the output varies significantly. For instance, in some cases a comparison of a genome with itself would (a) have a low aligned fraction (~40%), (b) have ~100% of the genome aligned, or (c) wouldn't even show in the output (presumably due to low coverage of the alignment). I've also seen different genomes with high ANI between them (~98%) sometimes appear in the output and sometimes not.
In all my 1 vs. 1 comparisons the output was consistent. The discrepant results appeared only when comparing two lists (in this case, the same list was used as both query and reference).
Here are the output of two independent runs: dereplicated_mags_ani_raw_1.txt dereplicated_mags_ani_raw_2.txt
EDIT: I performed a new test using the master branch. The results are still inconsistent and comparisons are missing from all the outputs I'm obtaining.