Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
189 stars 22 forks source link

RepeatModeler parallelization not working? #173

Closed TaehyungKwon closed 1 year ago

TaehyungKwon commented 2 years ago

Dear all,

I had used RepeatModeler years ago successfully and recently installed the conda version (2.0.2). However, the new version of RepeatModeler (precisely, RepeatScout) seems to stuck in only one thread, although I activated --pa parameter. This is the simple commandline for RepeatModeler run:

RepeatModeler \ -pa $(( (n_thread)/4 )) \ -engine ncbi \ -database ${out_prefix}

I am submitting this job on Slurm cluster. Is there something I am missing?

Thanks, Taehyung

ypchan commented 2 years ago

Dear all,

I had used RepeatModeler years ago successfully and recently installed the conda version (2.0.2). However, the new version of RepeatModeler (precisely, RepeatScout) seems to stuck in only one thread, although I activated --pa parameter. This is the simple commandline for RepeatModeler run:

RepeatModeler -pa $(( (n_thread)/4 )) -engine ncbi -database ${out_prefix}

I am submitting this job on Slurm cluster. Is there something I am missing?

Thanks, Taehyung

I also have this problem. Did you solve it? thanks @TaehyungKwon

TaehyungKwon commented 2 years ago

@ypchan Not yet. I just ran it with a single thread.

ChristophePatterson commented 1 year ago

Hi @TaehyungKwon, I have also been trying to get repeatmoderler2 to work on a HPC that uses slurm. I have found a solution that works for me, however I don't know why it work. I have resolved the lack of threading when using slurm by adding the following lines to my slurm script.

unset OMP_PROC_BIND
unset OMP_PLACES

Before adding this code my estimated runtime for round 6 was >300 hours. I am now able to fully run reapeatmoderler2 programme on a 1.4Gb draft genome in just over 24 hours.

My whole slurm script is as follows. You would also need to run BuildDatabase prior to running this code.

#!/bin/bash

#SBATCH -c 32 
#SBATCH --mem=200G            # memory required, in units of k,M or G, up to 250G.
#SBATCH --gres=tmp:400G       # $TMPDIR space required on each compute node, up to 400G.
#SBATCH -t 72:00:00         # time limit in format dd-hh:mm:ss

module load bioinformatics
module load repeatmodeler/2.0.3

unset OMP_PROC_BIND
unset OMP_PLACES

cd /nobackup/tmjj24/H_titia_genome/repeatmodeler/

RepeatModeler -database H_titia_trial_run -pa 32 -LTRStruct

cd ~

Hope this helps!

rmhubley commented 1 year ago

Currently RepeatModeler is parallelized under a multi-threaded model, not a distributed computing model. We have a project underway to refactor the code to run under both models. @ChristophePatterson, your SLURM job simply allocates a single node with 32 cpus in your cluster to run your RepeatModeler job. How you went from 300hours to 24 simply allocating a node through SLURM vs simply running it interactively on a login node (with 32cpus) is strange and probably points to a deeper problem with the SLURM run. Do you have the log file generated from both runs?

@TaehyungKwon, your observation is correct. Not all aspects of the RepeatModeler pipeline are parallelizable. For instance , the RepeatScout tool (Alkes et al) cannot be parallelized due to processing dependencies. We simply run that tool in the main program thread and wait for the results. Similarly RECON is not multi-threaded. However there are tools run in the package which are, and we pass the information provided in -pa to those tools. In addition, one of the most time consuming steps is the all-vs-all sequence alignment which is performed in rounds 2-6. This step can be broken up into batches and run in parallel. That is where the -pa parameter has the largest effect. The parameter indicates how many batches are run at the same time. There is one parallelizable step that we plan to implement in the next release that should improve the runtime of the later rounds (5,6 etc). That is the pre-masking of the samples prior to all-vs-all alignment. Look for that improvement in the 2.0.4 release.

cinnamon259 commented 1 year ago

@rmhubley was this issue with SLURM jobs resolved? I am running into the same issue, where round 6 is taking 400+ hours with a genome of size 1.1Gb.

Thanks,

Cinnamon

mylena-s commented 1 year ago

Sorry for the intromission, I also run repeatmodeler with slurm and I manage to run it in paralell including the following info in the slurm header, otherwise it generated a single proceess (so in this case I run it with -threads 40 or a little more)

SBATCH --ntasks=10

SBATCH --cpus-per-task=4

Hopefully It will help somebody with the same problem