Inconsistent output across runs of all-versus-all ANI computation

apcamargo commented 4 years ago

Hi @cjain7!

I'm using FastANI to compare a set of approximately 500 MAGs. To do that, I'm executing:

ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt

Across multiple runs I observed that the output varies significantly. For instance, in some cases a comparison of a genome with itself would (a) have a low aligned fraction (~40%), (b) have ~100% of the genome aligned, or (c) wouldn't even show in the output (presumably due to low coverage of the alignment). I've also seen different genomes with high ANI between them (~98%) sometimes appear in the output and sometimes not.

In all my 1 vs. 1 comparisons the output was consistent. The discrepant results appeared only when comparing two lists (in this case, the same list was used as both query and reference).

Here are the output of two independent runs: dereplicated_mags_ani_raw_1.txt dereplicated_mags_ani_raw_2.txt

EDIT: I performed a new test using the master branch. The results are still inconsistent and comparisons are missing from all the outputs I'm obtaining.

cjain7 commented 4 years ago

Thanks for sharing this problem.

I've noticed multiple issues that highlight this problem (also see #37 and #58 ); however i've failed to reproduce this issue on our compute clusters unfortunately. I'm willing to invest time into this problem, but need help so I can reproduce this behavior at my end for debugging.

Are you able to provide more details (e.g, mac/linux, gcc version, input data files) etc.. ?

apcamargo commented 4 years ago

I got this issue with both the Conda version (I believe they use GCC 7.*) and a statically compiled version in my personal computer (master branch, Ubuntu 16.04, GCC 7.5.0). I executed the runs in a cluster with SUSE Linux Enterprise Server 15.

By the way, I had a bug while compiling FastANI in my PC and I submitted a PR fixing it: https://github.com/ParBLiSS/FastANI/pull/68

I don't think I can share this specific dataset because it isn't mine. But I'll try to replicate the issue with my own genomes so I can send you the data. I can't promise that I'll be able to do that in the next few days, though.

Just to illustrate the extend of the inconsistency: I executed the all-versus-all comparison eight times and each run had ~16 comparisons that were not found in any of the other ones. I also noticed that this greatly influenced the definition of species (using an algorithm similar to the one used by GTDB).

cjain7 commented 4 years ago

Got it. Since one of the issue filed previously involved use of SLURM; curious if you too are using SLURM?

cjain7 commented 4 years ago

Is this is a locally owned cluster? Wondering if you can arrange a temporary account for me (perhaps for a week) ?

apcamargo commented 4 years ago

Yes, I'm using SLURM.

Unfortunately it is a big shared cluster and I have no control over it, otherwise I'd be happy to give you access to it.

cjain7 commented 4 years ago

Thanks! I guess the bug might be related to SLURM. When you get chance, can you send me your slurm job script/commands and job output log while running:

ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
/usr/bin/time fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt. #
printenv #please add an extra command for me

cjain7 commented 4 years ago

cc'ing @luke-dt

apcamargo commented 4 years ago

#!/bin/bash
#SBATCH --job-name=fastani
#SBATCH --account=fnglanot
#SBATCH --qos=genepool
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --constraint=haswell

ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
srun --cpus-per-task=64 --ntasks=1 /usr/bin/time fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt
printenv

Here's the log: slurm-31935719.txt

apcamargo commented 4 years ago

I executed the same command twice in a different cluster that uses PBS. For some reason it took much longer for FastANI to finish, but the outputs were different anyway.

wc -l fastani_output_1.txt fastani_output_2.txt
   3130 fastani_output_1.txt
   3149 fastani_output_2.txt

luke-dt commented 4 years ago

Here is the script that I run with sbatch:

#SBATCH --job-name=fastani
#SBATCH --mem=30G
#SBATCH --cpus-per-task=4
#SBATCH --output=slurm_out/fastani/z_fastani_%A.out
#SBATCH --error=slurm_out/fastani/z_fastani_%A.out

module load fastani/1.3.1a

basedir="$PWD"
outdir="${basedir}/d03_species_analysis/fastani"

fastANI --ql ${outdir}/genomepaths.txt \
        --rl ${outdir}/genomepaths.txt \
        -o ${outdir}/fastani_out.txt \
        -t ${SLURM_CPUS_PER_TASK}

log file for 4 threads (analysis worked) log file for 8 threads (comparisons missing

apcamargo commented 4 years ago

After executing with 4 cores I got consistent outputs. However there are some missing comparisons. The output I got from the execution with 4 cores has 3149 lines and a file that I built by aggregating multiple executions has 3271 lines.

Here are ANI vs. % aligned plots for these two files:

index2

index

It seems that most of the missing comparisons are from pairs with high ANI and low % aligned.

cjain7 commented 4 years ago

Thanks! I'm able to reproduce inconsistent output at my end on a cluster with SLURM, which is good! Will reach out if I need more info.

I replicated a single publicly-available genome and did a all-to-all among them. For a few pairs, I do see <100% ANI reported in an inconsistent manner. Please give me some time to investigate.

apcamargo commented 4 years ago

You're welcome!

For further context: to build the first figure I executed FastANI with 64 cores in a PBS cluster and aggregated the results into a single file. For lines in which the first and second genomes were the same but a different ANI was reported, I chose the one with the highest % aligned (which usually corresponded to the lowest ANI).

cjain7 commented 4 years ago

Hi @apcamargo , @luke-dt ,

Thanks again for your help! There was a bug in my code associated with file-io. I've committed the fix to master branch. When you get chance, please run the code again, and let me know if also fixes the issue at your end. I will create a new fastANI version after I hear from you.

cjain7 commented 4 years ago

Hi guys (@apcamargo , @luke-dt) let me know if you were able to check.

apcamargo commented 4 years ago

Hi @cjain7! I just submitted the job and I'll let you know when I get the results.

apcamargo commented 4 years ago

The SLURM cluster I have access to is in maintenance, so I executed fastANI in a PBS cluster with -t 120.

$ sha256sum dereplicated_mags_ani_raw_*

  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_1.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_2.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_3.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_4.txt
  05e10c23bcb57acd99d87680b35f374e5e07f3c70e8c48063e7cabb746b2f965  dereplicated_mags_ani_raw_5.txt

The bug seems to be fixed! Thank you @cjain7!

cjain7 commented 4 years ago

Good to know. Thanks! Closing the issue now. I'll create v1.4; please use that going forward.

apcamargo commented 4 years ago

Hey @cjain7

Even though the results are now consistent across runs, I noticed that there are still many comparisons missing from the output. I know that fastANI won't report comparisons of genomes with low % of alignment, but some of the missing comparisons were present in previous runs. Is this behaviour expected?

cjain7 commented 4 years ago

yeah, i think (or at least I hope) that output will be consistent from now onwards. Those cases you mention are probably border-line cases which cleared the ~80% cutoff by a small margin due to previous bug.

apcamargo commented 4 years ago

The strange thing is that the number of genomes in the output (520) is less than the total number of genomes (522), meaning that there are two genomes that are not being compared with themselves (certainly more than 80%)

cjain7 commented 4 years ago

Can you check if they have same file names?

cjain7 commented 4 years ago

Please create a new issue with more information (e.g., log files, input command etc.) if you would like me to look further.

apcamargo commented 4 years ago

I was just preparing a bug report and a noticed that the bug was in the script I was using to process the output. Sorry for the trouble!

Valentin-Bio commented 1 year ago

Hello, I'm having the same inconsistency problem but I'm not running FastANI via slurm. I'm running it on a Ubuntu machine and installed it using the compiled version from the master branch.

~/Downloads/FastANI/fastANI --ql filespaths.txt --rl filespaths.txt -t 7 -o ~/Documents/miriam/fastani_results.txt

The output table has different ANI values for the same compared genomes.

ParBLiSS / FastANI

Inconsistent output across runs of all-versus-all ANI computation #67