Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
360 stars 53 forks source link

Timing out on minimap-nd tasks #203

Open wgallin opened 6 months ago

wgallin commented 6 months ago

My assembly job is failing with Time Limit being exceeded during some of the minimap-nd jobs

It appears that when parallel tasks are being run the time allocated to their running is shorter than it time it takes to complete them.

An example log entry for a single job ( it appears that 10 of these have failed out of 100 submitted) is shown here: \ Error message hostname

Input data This is the relevant part of the slurm.out file

[100999 INFO] 2024-03-30 02:52:07 NextDenovo start... [100999 INFO] 2024-03-30 02:52:08 version:Unknown logfile:pid100999.log.info [100999 WARNING] 2024-03-30 02:52:09 Re-write workdir [100999 INFO] 2024-03-30 02:52:09 mkdir: /scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly [100999 INFO] 2024-03-30 02:52:10 mkdir: /scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly/01.raw_align [100999 INFO] 2024-03-30 02:52:10 mkdir: /scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly/02.cns_align [100999 INFO] 2024-03-30 02:52:10 mkdir: /scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly/03.ctg_graph [100999 INFO] 2024-03-30 02:52:18 Total jobs: 1 [100999 INFO] 2024-03-30 02:52:18 Submitted jobID:[18223332] jobCmd:[/scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly/01.raw_align/01.db_stat.sh.work/db_stat1/Trial02.sh] in the slur m_cycle. [100999 INFO] 2024-03-30 02:54:20 db_stat done [100999 INFO] 2024-03-30 02:54:20 updated options: rerun: 3 task: all deltmp: 1 rewrite: 1 read_type: ont job_type: slurm input_type: raw read_cutoff: 1k pa_correction: 5 seed_cutfiles: 5 parallel_jobs: 32 seed_depth: 38.12 genome_size: 300m seed_cutoff: 10000 job_prefix: Trial02 blocksize: 983465750 ctg_cns_options: -p 30 nextgraph_options: -a 1 sort_options: -m 50g -t 30 -k 40 minimap2_options_map: -x map-ont minimap2_options_raw: -t 8 -x ava-ont input_fofn: /scratch/wgallin/NextDeNovo_Test01/input.fofn correction_options: -p 30 -max_lq_length 10000 -r ont -min_len_seed 5000 workdir: /scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly minimap2_options_cns: -t 8 -x ava-ont -k 17 -w 17 --minlen 1000 --maxhan1 5000 raw_aligndir: /scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly/01.raw_align cns_aligndir: /scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly/02.cns_align ctg_graphdir: /scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly/03.ctg_graph [100999 INFO] 2024-03-30 02:54:20 summary of input data: file: /scratch/wgallin/NextDeNovo_Test01/Trial_02_Ppen_NextDenovo_Assembly/01.raw_align/input.reads.stat [Read length stat] Types Count (#) Length (bp) N10 49686 39610 N20 138374 24804 N30 277076 15991 N40 488598 10686 N50 795459 7571 N60 1219406 5562 N70 1792624 4116 N80 2576448 2961 N90 3705002 1970

Types Count (#) Bases (bp) Depth (X) Raw 7575648 28638422273 95.46 Filtered 1971087 1286477110 4.29 Clean 5604561 27351945163 91.17

*Suggested seed_cutoff (genome size: 300.00Mb, expected seed depth: 45, real seed depth: 38.12): 10000 bp

Config file [General] job_type = slurm job_prefix = Trial02 task = all rewrite = yes deltmp = yes parallel_jobs = 32 input_type = raw read_type = ont # clr, ont, hifi input_fofn = input.fofn workdir = Trial_02_Ppen_NextDenovo_Assembly

[correct_option] read_cutoff = 1k genome_size = 300m # estimated genome size sort_options = -m 50g -t 30 minimap2_options_raw = -t 8 pa_correction = 5 correction_options = -p 30

[assemble_option] minimap2_options_cns = -t 8 nextgraph_options = -a 1

Operating system LSB Version: n/a Distributor ID: Gentoo Description: Gentoo Base System release 2.6 Release: 2.6 Codename: n/a

GCC gcc version 9.3.0 (GCC)

Python 3.11

NextDenovo What version of NextDenovo are you using? 2.5.2

moold commented 6 months ago

Two solutions

  1. It seems that your system limits the running time of a job, so you can reduce blockize and increase seed_cutfiles to reduce the size of each subfile and speed up the map task. But the total runing time maybe will longer.
  2. see here or here to adjust the submit command.
DaniPaulo commented 5 months ago

Hi @wgallin . I'm still trying to figure out how to run NextDenovo in a HPC environment using SLURM. Would you be able to share your script.slurm.sh with me?

wgallin commented 5 months ago

Hi,

So I ddi figure out why I was having a problem, and worked around it.

The basic problem was that when running in Grid mode SLURM allows to system administrators to set the wall time for jobs that are submitted without an explicit wall time value.

In my case about 10% of jobs in one step ran over that time, so the whole job crashed because these jobs would not complete.

The solution that I used was to run the job in LOCAL mode on a single node with 32 cpus, 256G RAM and a wall time that turned out otherwise be much longer than the job actually took (I requested 7 days but the job finished in less than 4 days).

If I had wanted to run the job in Grid mode I would have needed to be able to set the wall times for at least some of the individual sub-jobs, but I could not find a way to do that in the submission script.

So I guess the solution to my problem would have been to allow the submission script or the configuration file to feed a user-defined wall-time (and probably memory allocation) to the individual sub-jobs that the parent job spawns onto the grid.

Warren Gallin

On Apr 22, 2024, at 2:41 AM, DaniPaulo @.***> wrote:

Hi @wgallin https://github.com/wgallin . I'm still trying to figure out how to run NextDenovo in a HPC environment using SLURM. Would you be able to share your script.slurm.sh with me?

— Reply to this email directly, view it on GitHub https://github.com/Nextomics/NextDenovo/issues/203#issuecomment-2068824487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEW55K3DYL4ZKPKZBXMIPODY6TEJ7AVCNFSM6AAAAABFRYWY5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYHAZDINBYG4. You are receiving this because you were mentioned.

DaniPaulo commented 5 months ago

Hi @wgallin,

Thanks for your response. Let's see if I understand. So basically, you set up your script.slurm.sh to:

#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 32
#SBATCH --mem 256G
#SBATCH --time 7-00:00:00

# MODULES
module load nextdenovo

# MAIN
nextDenovo run.cfg

And your run.cfg to use local, one parallel job and -t / -p to 32:

[General]
job_type = local
parallel_jobs = 1

[correct_option]
sort_options = -t 32
minimap2_options_raw = -t 32
pa_correction = 3
correction_options = -p 32

[assemble_option]
minimap2_options_cns = -t 32

Could you please verify this? I appreciate your help, Dani.