Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

Running out of disk space #212

Closed oskesr closed 2 weeks ago

oskesr commented 2 weeks ago

Describe the bug The SSD with the working directory has 1.1TB free. Over time, it fill up and there is 0 bytes remaining. Program doesnt give error but has no updates in status.

System Pop!_OS 22.04 LTS 128 threads, 512GB RAM

GCC gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

Python 3.11.5

NextDenovo 2.5.2

pid.log.info

[20841 INFO] 2024-06-11 14:17:52 NextDenovo start... [20841 INFO] 2024-06-11 14:17:52 version:2.5.2 logfile:pid20841.log.info [20841 WARNING] 2024-06-11 14:17:52 Re-write workdir [20841 INFO] 2024-06-11 14:17:52 skip mkdir: /out_nextden_v2/01_rundir [20841 INFO] 2024-06-11 14:17:52 skip mkdir: /out_nextden_v2/01_rundir/01.raw_align [20841 INFO] 2024-06-11 14:17:52 skip mkdir: /out_nextden_v2/01_rundir/02.cns_align [20841 INFO] 2024-06-11 14:17:52 skip mkdir: /out_nextden_v2/01_rundir/03.ctg_graph [20841 INFO] 2024-06-11 14:17:52 skip step: db_stat [20841 INFO] 2024-06-11 14:17:52 updated options: rerun: 3 task: all deltmp: 1 rewrite: 1 read_type: ont job_type: local input_type: raw read_cutoff: 1k genome_size: 1g parallel_jobs: 6 seed_depth: 45.0 pa_correction: 3 seed_cutfiles: 3 seed_cutoff: 21711 blocksize: 12538658079 ctg_cns_options: -p 20 nextgraph_options: -a 1 job_prefix: nextDenovoDupAll sort_options: -m 60g -t 20 -k 40 minimap2_options_map: -x map-ont minimap2_options_raw: -x ava-ont -t 8 --minlen 1000 workdir: /out_nextden_v2/01_rundir input_fofn: /out_nextden_v2/input.fofn correction_options: -p 20 -max_lq_length 10000 -r ont -min_len_seed 10855 minimap2_options_cns: -t 8 -x ava-ont -k 17 -w 17 --minlen 2000 --maxhan1 5000 raw_aligndir: /out_nextden_v2/01_rundir/01.raw_align cns_aligndir: /out_nextden_v2/01_rundir/02.cns_align ctg_graphdir: /out_nextden_v2/01_rundir/03.ctg_graph [20841 INFO] 2024-06-11 14:17:52 summary of input data: file: /out_nextden_v2/01_rundir/01.raw_align/input.reads.stat [Read length stat] Types Count (#) Length (bp) N10 129480 58658 N20 316017 44887 N30 557075 34866 N40 869037 26753 N50 1280672 19998 N60 1841548 14425 N70 2638736 9881 N80 3868666 5984 N90 6101275 3592

Types Count (#) Bases (bp) Depth (X) Raw 11969839 96270451854 96.27 Filtered 2218410 1119819538 1.12 Clean 9751429 95150632316 95.15

*Suggested seed_cutoff (genome size: 1000.00Mb, expected seed depth: 45, real seed depth: 45.00): 21711 bp [20841 INFO] 2024-06-11 14:17:52 skip step: db_split [20841 INFO] 2024-06-11 14:17:52 Total jobs: 18 [20841 INFO] 2024-06-11 14:17:52 Submitted jobID:[20842] jobCmd:[/out_nextden_v2/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align01/nextDenovoDupAll.sh] in the local_cycle. [20841 INFO] 2024-06-11 14:17:53 Submitted jobID:[20848] jobCmd:[/out_nextden_v2/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align02/nextDenovoDupAll.sh] in the local_cycle. [20841 INFO] 2024-06-11 14:17:53 Submitted jobID:[20857] jobCmd:[/out_nextden_v2/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align03/nextDenovoDupAll.sh] in the local_cycle. [20841 INFO] 2024-06-11 14:17:54 Submitted jobID:[20866] jobCmd:[/out_nextden_v2/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align04/nextDenovoDupAll.sh] in the local_cycle. [20841 INFO] 2024-06-11 14:17:54 Submitted jobID:[20875] jobCmd:[/out_nextden_v2/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align05/nextDenovoDupAll.sh] in the local_cycle. [20841 INFO] 2024-06-11 14:17:55 Submitted jobID:[20884] jobCmd:[/out_nextden_v2/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align06/nextDenovoDupAll.sh] in the local_cycle. [20841 INFO] 2024-06-13 07:28:22 Submitted jobID:[825257] jobCmd:[/out_nextden_v2/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align07/nextDenovoDupAll.sh] in the local_cycle.

Config file [General] job_type = local # local, slurm, sge, pbs, lsf job_prefix = nextDenovoDupAll task = all # all, correct, assemble rewrite = yes # yes/no deltmp = yes parallel_jobs = 6 # number of tasks used to run in parallel input_type = raw # raw, corrected read_type = ont # clr, ont, hifi input_fofn = input.fofn workdir = 01_rundir

[correct_option] read_cutoff = 1k genome_size = 1g # estimated genome size sort_options = -m 60g -t 20 minimap2_options_raw = -x ava-ont -t 8 --minlen 1000 pa_correction = 3 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage. correction_options = -p 20

[assemble_option] minimap2_options_cns = -t 8 nextgraph_options = -a 1

moold commented 2 weeks ago

No other solution except to increase the working directory storage.

oskesr commented 2 weeks ago

Thanks for the clarification, could you suggest what size could be needed? If there was an estimate such as for x bases you need 2x storage ?

Additionally, would you reccommend any changes in the parameters for utilizing the threads more effectively ?

moold commented 2 weeks ago
  1. Normally, 1T of storage is sufficient to assemble a 1G genome, but the actual size depends on the genome's repetitive sequence content and the input bases. SO it is hard to say.
  2. See here to optimize parallel computing parameters.