Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
361 stars 53 forks source link

processes have been killed by the cgroup out-of-memory handler. #48

Closed biowackysci closed 4 years ago

biowackysci commented 4 years ago

When I run the pipeline for a 206.442627 input fastq file, the progress was struck at step /01.raw_align/02.raw_align.sh.work/. The error reported was "slurmstepd: error: Detected 1 oom-kill event(s) in step 1481924.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. ". My run.cfg file is [General] job_type = slurm # here we use SGE to manage jobs job_prefix = nextDenovo task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = 3 parallel_jobs = 22 input_type = raw input_fofn = /group/pasture/Saila/NextDenovo/smartdenovo.input.fofn # input file workdir = /group/pasture/Saila/NextDenovo

[correct_option] read_cuoff = 1k seed_cutoff = 20000 # the recommended minimum seed length blocksize = 5g pa_correction = 5 seed_cutfiles = 5 sort_options = -m 50g -t 30 -k 50 minimap2_options_raw = -x ava-ont -t 8 correction_options = -p 30 cluster_options = --cpus-per-task={cpu} --mem-per-cpu={vf}

[assemble_option] random_round = 100 minimap2_options_cns = -x ava-ont -t 8 -k17 -w17 nextgraph_options = -a 1

I am not sure if I need to increase the memory and if its the case can you please suggest how much I need to ? My cluster is slurm.

Thanks S

moold commented 4 years ago

Hi, our tests show a typical minimap2-nd task usually consumes about 40g of memory, but you can use -I parameter to reduce the max memory requirement. In addition, you can also adjust -t in minimap2_options_raw and minimap2_options_cns to adapt the maximum sub-jobs running on each nodes simultaneously.

biowackysci commented 4 years ago

hello again, i tried with the modified options for minimap2_options_raw = -x ava-ont -t 32 -I 100 and minimap2_options_cns = -x ava-ont -t 32 -k17 -w17 but now the pipeline stops at 01.raw_align/02.raw_align.sh.work/raw_align000/ with a error description of "slurmstepd: error: JOB 1734386 ON comp054 CANCELLED AT 2020-02-08T21:42:13 DUE TO TIME LIMIT " Is it something I can modify in the script or just restart the job?

Also, is there an option where I can filter for reads above certain read lengths ?

Thanks in advance so much for your help

S

moold commented 4 years ago

It seems that your system limits the running time of a job, so you can reduce blockize and increase seed_cutfiles to reduce the size of each subfile and speed up the map task. But the total runing time maybe will longer.

biowackysci commented 4 years ago

Thanks will try this now and update soon with the outcome

Regards, S

biowackysci commented 4 years ago

Hello again , I modified the script with reduced blocksize of 3g and increased the seed_cutfiles to 10 but it still stalls at [ERROR] 2020-02-21 06:48:13,170 /group/pasture/Saila/NextDenovo/01.raw_align/02.raw_align.sh.work/raw_align159/nextDenovo.sh.e with an error description of slurmstepd: error: JOB 1809950 ON comp035 CANCELLED AT 2020-02-21T07:00:20 DUE TO TIME LIMIT Not sure if i should reduce the blocksize still to smaller size ? Can you please advice Thanks S

moold commented 4 years ago

Yes or contact your system administrator for help, your system limits the running time of a job.

lifan18 commented 2 years ago

Dear Dr. Hu,

Thank you for your replies above. I have the same question, "out-of-memory". I tried to adjust the -t in minimap2_options_raw and minimap2_options_cns as -t 8, but it still failed to pass this error.

see run.cfg parameters: " [correct_option] read_cutoff = 1k genome_size = 3.23g # estimated genome size sort_options = -m 20g -t 8 minimap2_options_raw = -t 8 pa_correction = 5 correction_options = -p 30

[assemble_option] minimap2_options_cns = -t 8 nextgraph_options = -a 1 "

see error message: minimap2-nd --step 1 --dual=yes -t 8 -x ava-pb /01.raw_align/input.seed.005.2bit /01.r aw_align/input.part.004.2bit -o input.seed.005.2bit.163.ovl slurmstepd-c03b06n04: error: Detected 1 oom-kill event(s) in step 593045.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The slurm cluster I used has enough memory (q_fat, 72 cpus/1.5T memory/2 nodes), but the default normal queue has limited memory. The subtasks submitted by paralleltask always use default normal queue even though I submit the main job to q_fat Therefore, I tried to use submit specify the queue for paralleltask, for example submit = sbatch -q q_fat, but it seems it was a wrong way to specify queue.

I also tried the way in [https://nextdenovo.readthedocs.io/en/latest/FAQ.html#how-to-optimize-parallel-computing-parameters].

Could you tell me how to specify the queue in a correct way or give some suggestion for this out-of-memory issue?

Thank you very much!

Regards,

LF

moold commented 2 years ago

Two solutions:

  1. use submit = sbatch -q q_fat --cpus-per-task={cpu} --mem-per-cpu={mem} -o {out} -e {err} {script}
  2. set job_type = local and submit the main job to q_fat
lifan18 commented 2 years ago

Thank you! I will update you once it is done.

lifan18 commented 2 years ago

Thank you! This issue solved through job_type= slurm and submit = sbatch -p q_fat --cpus-per-task=1 --mem-per-cpu=64g -o {out} -e {err} {script} ;D