zengpeng2012 commented 4 years ago

Hi, Hu

When I run the pipeline for a 2.8Tb input fasta file, the progress was struck at step 02.cns_align/01.get_cns.sh.work, it reported report an error out-of-memory. how can I carry on the job, or need larger memory nodes? My cluster is slurm and the node memory is 192GB and 36 cpus.

Blew is the configure file details [General] job_type = slurm job_prefix = Pp task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = no rerun = 3 parallel_jobs = 50 input_type = raw input_fofn = ./input.fofn workdir = ./01_rundir

usetempdir = /tmp/test

nodelist = avanode.list.fofn

cluster_options = -p q_cn -J nextDenovo -o nextDenovo.out -N 1 -n 1 -c 19 [correct_option] read_cutoff = 1k seed_cutoff = 15k blocksize = 1g pa_correction = 50 seed_cutfiles = 50 sort_options = -m 20g -t 20 -k 50 minimap2_options_raw = -x ava-ont -t 30 correction_options = -p 15

[assemble_option] random_round = 10 minimap2_options_cns = -x ava-ont -t 30 -k17 -w17 nextgraph_options = -a 1

moold commented 4 years ago

Could you provide the error log file?

zengpeng2012 commented 4 years ago

error log:

$cat Pp.sh.e hostname

hostname cd /GPFS/zhangli_lab_permanent/zengpeng/Pp/01_rundir/02.cns_align/01.get_cns.sh.work/get_cns000
cd /GPFS/zhangli_lab_permanent/zengpeng/Pp/01_rundir/02.cns_align/01.get_cns.sh.work/get_cns000 time python /usr/nzx-cluster/apps/NextDenovo/NextDenovo/lib/nextcorrect.py -f /GPFS/zhangli_lab_permanent/zengpeng/Pp/./01_rundir/02.cns_align//01.get_cns.input.idxs -i /GPFS/zhangli_lab_permanent/zengpeng/Pp/01_rundir/01.raw_align/03.sort_align.sh.work/sort_align000/input.seed.053.sorted.ovl -p 15 -max_lq_length 10000 -o cns.fasta;
python /usr/nzx-cluster/apps/NextDenovo/NextDenovo/lib/nextcorrect.py -f /GPFS/zhangli_lab_permanent/zengpeng/Pp/./01_rundir/02.cns_align//01.get_cns.input.idxs -i /GPFS/zhangli_lab_permanent/zengpeng/Pp/01_rundir/01.raw_align/03.sort_align.sh.work/sort_align000/input.seed.053.sorted.ovl -p 15 -max_lq_length 10000 -o cns.fasta [INFO] 2019-11-26 10:01:20,051 Corrected step options: [INFO] 2019-11-26 10:01:20,051 Namespace(blacklist='/GPFS/zhangli_lab_permanent/zengpeng/Pp/01_rundir/01.raw_align/03.sort_align.sh.work/sort_align000/input.seed.053.sorted.ovl.bl', dbuf=False, fast=False, idxs='/GPFS/zhangli_lab_permanent/zengpeng/Pp/./01_rundir/02.cns_align//01.get_cns.input.idxs', max_cov_aln=130, max_lq_length=10000, min_cov_base=4, min_cov_seed=10, min_error_corrected_ratio=0.8, min_len_aln=500, min_len_seed=10000, out='cns.fasta', ovl='/GPFS/zhangli_lab_permanent/zengpeng/Pp/01_rundir/01.raw_align/03.sort_align.sh.work/sort_align000/input.seed.053.sorted.ovl', process=15, split=False) [WARNING] 2019-11-26 10:01:20,541 Skip 494197 seeds in blacklist. [INFO] 2019-11-26 10:01:20,638 Start a cns worker in 54776 from parent 54769 [INFO] 2019-11-26 10:01:20,638 Start a cns worker in 54778 from parent 54769 [INFO] 2019-11-26 10:01:20,639 Start a cns worker in 54777 from parent 54769 [INFO] 2019-11-26 10:01:20,640 Start a cns worker in 54780 from parent 54769 [INFO] 2019-11-26 10:01:20,640 Start a cns worker in 54779 from parent 54769 [INFO] 2019-11-26 10:01:20,640 Start a cns worker in 54781 from parent 54769 [INFO] 2019-11-26 10:01:20,641 Start a cns worker in 54782 from parent 54769 [INFO] 2019-11-26 10:01:20,641 Start a cns worker in 54783 from parent 54769 [INFO] 2019-11-26 10:01:20,643 Start a cns worker in 54785 from parent 54769 [INFO] 2019-11-26 10:01:20,643 Start a cns worker in 54784 from parent 54769 [INFO] 2019-11-26 10:01:20,644 Start a cns worker in 54786 from parent 54769 [INFO] 2019-11-26 10:01:20,644 Start a cns worker in 54787 from parent 54769 [INFO] 2019-11-26 10:01:20,645 Start a cns worker in 54788 from parent 54769 [INFO] 2019-11-26 10:01:20,645 Start a cns worker in 54789 from parent 54769 [INFO] 2019-11-26 10:01:20,646 Start a cns worker in 54790 from parent 54769 Failed read DB file [/GPFS/zhangli_lab_permanent/zengpeng/Pp/./01_rundir/01.raw_align/input.part.135.2bit] into buf, and disable module of reading DB files into buf and continue... Failed read DB file [/GPFS/zhangli_lab_permanent/zengpeng/Pp/./01_rundir/01.raw_align/input.part.135.2bit] into buf, and disable module of reading DB files into buf and continue... slurmstepd-c02b06n06: error: JOB 6145922 ON c02b06n06 CANCELLED AT 2019-11-26T10:04:08 slurmstepd-c02b06n04: error: JOB 6145920 ON c02b06n04 CANCELLED AT 2019-11-26T10:04:08 slurmstepd-c02b06n06: error: Detected 1 oom-kill event(s) in step 6145922.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. slurmstepd-c02b06n04: error: Detected 1 oom-kill event(s) in step 6145920.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

moold commented 4 years ago

Try to use the parameter: correction_options = -p 15 -dbuf

zengpeng2012 commented 4 years ago

it works, memory request < 3gb, but outputs very slowly.

moold commented 4 years ago

Use usetempdir option or remove -dbuf option, will speed up.

Nextomics / NextDenovo

What is the memory requirement for nodes when run a huge input fastq? #34

usetempdir = /tmp/test

nodelist = avanode.list.fofn