Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
230 stars 50 forks source link

Issue on low performance running of RepeatMasker on HPC cluster #276

Open manighanipoor opened 2 months ago

manighanipoor commented 2 months ago

I am running Repeatmasker on a snake genome using a denovo TE library (created by RM2) on a HPC cluster using 20 CPUs. But the speed is very low and I encounter node failure. I contacted HPC support and they believe it happens because of system overhead due to high number of batches. I just realized RepeatMasker performs better if I increase the "-fra" option to 1000000 as it drops number of batches. Do you think it would affect TE identification sensitivity or accuracy?

Cheers, Mani

rmhubley commented 2 months ago

We use clusters at UCSC, Texas Tech and Univ of Arizona and I haven't seen an issue with batch overhead, but perhaps your cluster has some restrictive quotas that are interfering with the runs. With any cluster I would recommend making sure you are running on a local disk (local to the machine) for speed, and breaking up your sequence into batches of 50MB (or higher ) and run them independently through RepeatMasker on different nodes ( leaving the default -frag parameter ). We have a Nextflow script that does this for you on Slurm-based clusters ( https://github.com/Dfam-consortium/RepeatMasker_Nextflow ). If you change the -frag parameter, you increase the size in which the GC background value is determined. This is used to select the appropriate scoring matrix used during alignment of consensus sequences. If you increase this too much you will probably lose some lower-scoring annotations in your output.

manighanipoor commented 1 month ago

Thanks,

How can we configure the RepeatMasker_Nextflow script to run batches on different nodes as it doesn't seem to be preconfigured for that? Shou I ask HPC support to do that?

Cheers, Mani

rmhubley commented 1 month ago

That's exactly what it's meant to do. We regularly run it on 100's of nodes. There is an option "--cluster" that currently accepts either "local" or one of several cluster names that we use. You will need to edit the RepeatMasker_Nextflow.nf file and configure it for your needs. For instance, look in the script for where quanah is defined:

///////                
/////// CUSTOMIZE CLUSTER ENVIRONMENT HERE BY ADDING YOUR OWN
/////// CLUSTER NAMES OR USE 'local' TO RUN ON THE CURRENT 
/////// MACHINE.
///////                                              
// No cluster...just local execution
if ( params.cluster == "local" ) {
...
}else if ( params.cluster == "quanah" || params.cluster == "nocona" ){
  thisExecutor = "slurm"
  thisQueue = params.cluster                                                                   
  thisOptions = "--tasks=1 -N 1 --cpus-per-task=${proc} --exclude=cpu-23-1"
  thisAdjOptions = "--tasks=1 -N 1 --cpus-per-task=2 --exclude=cpu-23-1"       
  ucscToolsDir="/lustre/work/daray/software/ucscTools"             
  repeatMaskerDir="/lustre/work/daray/software/RepeatMasker-4.1.2-p1"    
  thisScratch = false                                                
}

You would modify this block to accept the name of your cluster and set it's parameters here. Nextflow supports quite a few cluster job managers. The above example uses SLURM. Once you have made your changes, you simply use the "--cluster myclustername" option when you run.

manighanipoor commented 1 month ago

Thnaks for your help.

manighanipoor commented 1 month ago

You mentioned: With any cluster I would recommend making sure you are running on a local disk (local to the machine) for speed

Just wondering how can we run it locally on the cluster?

rmhubley commented 1 month ago

This depends on your cluster's architecture. Most often the individual compute nodes have hard drive (or SSD) attached. The administrator may have decided not to make that accessible to jobs running on that node. On many clusters, the admins have setup a scratch area on those local drives, where files can be copied and processes create temporary files more efficiently than over NFS. If your cluster supports this, the Nextflow script can take advantage of it.

manighanipoor commented 1 month ago

Thanks for the comment