MrOlm / drep

Rapid comparison and dereplication of genomes
263 stars 37 forks source link

Extremely slow on cluster #215

Closed schmittel closed 8 months ago

schmittel commented 11 months ago

Hi there,

I'm having an issue where it's taking ~4 days to dereplicate 1500 bacterial assemblies. I have many batches consisting of these ~1500 assemblies so overall this is going to take way too long. Given your knowledge of the different programs run by dRep and their efficiencies, I'm wondering whether you could offer any advice for optimizing the cluster jobs that I am submitting? Here are the parameters I am currently working with:

# Number of nodes and MPI tasks per node:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
# Enable Hyperthreading:
#SBATCH --ntasks-per-core=2
# for OpenMP:
#SBATCH --cpus-per-task=20

Do you have any suggestions for adjustments that might be specifically optimal for dRep?

Here's my dRep command:

dRep dereplicate \
        --processors 40 \
        --genomes "${input_dir}/${genome}.txt" \
        --genomeInfo "${genome_info_dir}/${genome}.csv" \
        --completeness 50 \
        --contamination 10 \
        --S_algorithm ANImf \
        --S_ani 0.95 \
        --run_tertiary_clustering \
        --SkipMash \
        --cov_thresh 0.4 \
        "${output_dir}/${genome}"

Many thanks

MrOlm commented 11 months ago

Hi @schmittel. I see. The issue is really coming down to the combination of --SkipMash and --S_algorithm ANImf. When you skip Mash, it's going to require 1500x1500 = 2,250,000 genome comparisons. So even if each comparison is relatively quick, the run is going to take a long time (4 days seems about what I would expect).

In this case the main thing to do is to adjust to --S_algorithm fastANI. It's about 10 times faster than ANImf, so your run should take about 10% as long. You could also remove --run_tertiary_clustering in this scenario, since with --SkipMash it probably isn't impacting things much anyways.

Best, Matt