Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
360 stars 53 forks source link

Assembly is running around a month and going strong - or is it stalled? #170

Open 000generic opened 1 year ago

000generic commented 1 year ago

Describe the bug Unsure if assembly of octopus (human-sized) genome with 43x seed is active or stalled after running almost a month with 500 Gb RAM 60 CPU and 2 Tb disk.

Error message There is no error but a month ago I used NextDenovo on the machine to successfully assemble a sponge genome 1/10 the size overnight - vs - the current octopus genome is only 10x larger but running very long now.

Memory on 8 jobs running with ~7 CPUs are each cycling between 3.8 to 5.6 Gb RAM over hours - so seems like it could be active - using a very steady 87% of all CPUs on machine and 40% of memory. However Glances and Top indicate a stalled status of S running the MiniMap2-nd step (see attached screenshots). Every once and a while one of the jobs will drop for minutes to maybe an hour from 7 to 1 CPU - but then return to 7.

Previous jobs unrelated to NextDenovo sometimes have a status of S but finish no problem - so I wasn't sure how critical the status is - it is a very steady S.

I previously restarted the job after 2 weeks, given it was more than 10x longer in run time than sponge at that point - but restart went almost all the way back to the beginning, as there is no output / update from the minimap2-nd step. And I did a fresh start with a few short (minute or less) initial restarts before the current month-long run - so fresh from the initial 2-week run.

The last pid log readout indicates 36 jobs for cns_align.sh - with the largest job number of the 8 jobs at the start being 59306 (see below). Within a day or so the largest job was 59311 (see screenshot) - suggesting nextDenovo is on the last round of jobs to reach the allotted 36 - but then things have simply stayed here for weeks.

Here are details on this:

[59245 INFO] 2023-02-24 12:04:29 skip step: db_split [59245 INFO] 2023-02-24 12:04:29 skip step: raw_align [59245 INFO] 2023-02-24 12:04:29 skip step: sort_align [59245 INFO] 2023-02-24 12:04:29 skip step: seed_cns [59245 INFO] 2023-02-24 12:04:29 seed_cns finished, and final corrected reads file: [59245 INFO] 2023-02-24 12:04:29 ESC[35m /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fasta ESC[0m [59245 INFO] 2023-02-24 12:04:29 Total jobs: 36 [59245 INFO] 2023-02-24 12:04:29 Submitted jobID:[59246] jobCmd:[/scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align01/nextDenovo.sh] in the local_cycle. [59245 INFO] 2023-02-24 12:04:29 Submitted jobID:[59252] jobCmd:[/scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align02/nextDenovo.sh] in the local_cycle. [59245 INFO] 2023-02-24 12:04:30 Submitted jobID:[59261] jobCmd:[/scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align03/nextDenovo.sh] in the local_cycle. [59245 INFO] 2023-02-24 12:04:30 Submitted jobID:[59270] jobCmd:[/scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align04/nextDenovo.sh] in the local_cycle. [59245 INFO] 2023-02-24 12:04:31 Submitted jobID:[59279] jobCmd:[/scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align05/nextDenovo.sh] in the local_cycle. [59245 INFO] 2023-02-24 12:04:31 Submitted jobID:[59288] jobCmd:[/scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align06/nextDenovo.sh] in the local_cycle. [59245 INFO] 2023-02-24 12:04:32 Submitted jobID:[59297] jobCmd:[/scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align07/nextDenovo.sh] in the local_cycle. [59245 INFO] 2023-02-24 12:04:32 Submitted jobID:[59306] jobCmd:[/scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align08/nextDenovo.sh] in the local_cycle.

Ram usage is 40% and CPU usage is 87% - the general set up is similar but rescaled to the new genome size from what I did for sponge. I wonder if somehow my calculations might have been off and its doesn't have the resources to output or finish at this point...?

Genome characteristics Genome size is estimated around 3 Gb - high repeat content - likely high heterozygosity.

Input data Total base count, sequencing depth, average/N50 read length...

rerun: 3 task: all deltmp: 1 rewrite: 1 read_type: clr job_type: local input_type: raw parallel_jobs: 8 read_cutoff: 15k pa_correction: 7 seed_cutfiles: 7 seed_depth: 43.64 genome_size: 2.8g seed_cutoff: 15001 blocksize: 11726373 job_prefix: nextDenovo ctg_cns_options: -p 7 nextgraph_options: -a 1 sort_options: -m 70g -t 8 -k 38 minimap2_options_map: -x map-pb minimap2_options_raw: -t 8 -x ava-pb correction_options: -p 7 -max_lq_length 1000 -min_len_seed 7500 minimap2_options_cns: -t 7 -x ava-pb -k 17 -w 17 --minlen 1500 --maxhan1 5000 input_fofn: /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/input.fofn workdir: /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly raw_aligndir: /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/01.raw_align cns_aligndir: /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align ctg_graphdir: /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/03.ctg_graph [59245 INFO] 2023-02-24 12:04:29 summary of input data: file:ESC[35m /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/01.raw_align/input.reads.stat ESC[0m [Read length stat] Types Count (#) Length (bp) N10 266015 39329 N20 608403 32855 N30 1007493 28710 N40 1458869 25618 N50 1961372 23141 N60 2515311 21068 N70 3122091 19277 N80 3783903 17704 N90 4503641 16293

Types Count (#) Bases (bp) Depth (X) Raw 28758338 245628751872 87.72 Filtered 23472855 123430622516 44.08 Clean 5285483 122198129356 43.64

*Suggested seed_cutoff (genome size: 2800.00Mb, expected seed depth: 45, real seed depth: 43.64): 15001 bp

Config file Please paste the complete content of the Config file (run.cfg) to here.

[General] job_type = local # local, slurm, sge, pbs, lsf job_prefix = nextDenovo task = all # all, correct, assemble rewrite = yes # yes/no deltmp = yes parallel_jobs = 8 # number of tasks used to run in parallel input_type = raw # raw, corrected read_type = clr # clr, ont, hifi input_fofn = input.fofn workdir = output/3-nextDenovo-assembly

[correct_option] read_cutoff = 15k genome_size = 2.8g # estimated genome size sort_options = -m 70g -t 8 minimap2_options_raw = -t 8 pa_correction = 7 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage. correction_options = -p 7

[assemble_option] minimap2_options_cns = -t 7 nextgraph_options = -a 1

see https://nextdenovo.readthedocs.io/en/latest/OPTION.html for a detailed introduction about all the parameters

Operating system Which operating system and version are you using? You can use the command lsb_release -a to get it.

Distributor ID: Debian Description: Debian GNU/Linux 10 (buster) Release: 10 Codename: buster

GCC What version of GCC are you using? You can use the command gcc -v to get it.

Salk :) gcc -v Reading specs from /nadata/mnlsc/home/eedsinger/anaconda3/bin/../lib/gcc/x86_64-conda-linux-gnu/7.5.0/specs COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/nadata/mnlsc/home/eedsinger/anaconda3/bin/../libexec/gcc/x86_64-conda-linux-gnu/7.5.0/lto-wrapper Target: x86_64-conda-linux-gnu Configured with: /home/conda/feedstock_root/build_artifacts/ctng-compilers_1596267513165/work/.build/x86_64-conda-linux-gnu/src/gcc/configure --build=x86_64-build_pc-linux-gnu --host=x86_64-build_pc-linux-gnu --target=x86_64-conda-linux-gnu --prefix=/home/conda/feedstock_root/build_artifacts/ctng-compilers_1596267513165/work/gcc_built --with-sysroot=/home/conda/feedstock_root/build_artifacts/ctng-compilers_1596267513165/work/gcc_built/x86_64-conda-linux-gnu/sysroot --enable-languages=c,c++,fortran,objc,obj-c++ --with-pkgversion='crosstool-NG 1.24.0.131_87df0e6_dirty' --enable-__cxa_atexit --disable-libmudflap --enable-libgomp --disable-libssp --enable-libquadmath --enable-libquadmath-support --enable-libsanitizer --enable-libmpx --with-gmp=/home/conda/feedstock_root/build_artifacts/ctng-compilers_1596267513165/work/.build/x86_64-conda-linux-gnu/buildtools --with-mpfr=/home/conda/feedstock_root/build_artifacts/ctng-compilers_1596267513165/work/.build/x86_64-conda-linux-gnu/buildtools --with-mpc=/home/conda/feedstock_root/build_artifacts/ctng-compilers_1596267513165/work/.build/x86_64-conda-linux-gnu/buildtools --with-isl=/home/conda/feedstock_root/build_artifacts/ctng-compilers_1596267513165/work/.build/x86_64-conda-linux-gnu/buildtools --enable-lto --enable-threads=posix --enable-target-optspace --enable-plugin --enable-gold --disable-nls --disable-multilib --with-local-prefix=/home/conda/feedstock_root/build_artifacts/ctng-compilers_1596267513165/work/gcc_built/x86_64-conda-linux-gnu/sysroot --enable-long-long --enable-default-pie Thread model: posix gcc version 7.5.0 (crosstool-NG 1.24.0.131_87df0e6_dirty)

Python What version of Python are you using? You can use the command python --version to get it.

Python 3.8.12

NextDenovo What version of NextDenovo are you using? You can use the command nextDenovo -v to get it.

nextDenovo v2.5.0

Screenshot 2023-03-20 at 9 14 21 PM Screenshot 2023-03-20 at 9 12 32 PM

Any suggestions would be greatly appreciated - NextDenovo did simply fantastic on sponge - just not sure what is going on now with octopus. Some sort of user error but I am just stuck as to what it might be at this point.

Thank you very much :) Eric

moold commented 1 year ago

Hi, could you paste the content of some files: /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align*/nextDenovo.sh.e to here?

000generic commented 1 year ago

Sure! Here is the last one (09 to 36 are like this one - only an sh script in the folder):

/scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align36/nextDenovo.sh

!/bin/bash

set -xveo pipefail hostname cd /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align36 ( time /nadata/mnlsc/home/eedsinger/software/nextdenovo/NextDenovo/bin/minimap2-nd -I 6G --step 2 -t 7 -x ava-pb -k 17 -w 17 --minlen 1500 --maxhan1 5000 /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/01.seed_cns.sh.work/seed_cns8/cns.fasta /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/01.seed_cns.sh.work/seed_cns8/cns.fasta -o cns.filt.dovt.ovl; ) touch /scratch2/eedsinger/projects/genomes/zanfona-5x-50x/octopus-sinensis/output/3-nextDenovo-assembly/02.cns_align/02.cns_align.sh.work/cns_align36/nextDenovo.sh.done

000generic commented 1 year ago

And here is the first one (01-08 are similar to this):

hostname

moold commented 1 year ago

Try to increase -k -w in minimap2_options_cns, such as minimap2_options_cns = -t 7 -k 31 -w 17

000generic commented 1 year ago

Ok! I'll kill things and restart fresh....

000generic commented 1 year ago

Actually - rather than starting totally fresh, I updated the run file and just deleted folders 2 and 3 to save a little time and see how things go with your update more quickly.

Given the set up - how long would you expect things to run - just so I can know when its going over.

moold commented 1 year ago

I don't know how long it will take , but you can try it first. No need to start fresh, just continue running from the breakpoint . You can also set -f 0.0004 in minimap2_options_cns to speed up.

PS: try to check the value (mid_occ = 4956 ) of mid_occ in log files cns_align*/nextDenovo.sh.e, if it less than 1000 I think it is acceptable.

000generic commented 1 year ago

minimap2-nd started up - first 8 of 36 jobs again - mid_occ is now under 1000 at 427 - jobs are still running under status S but that might be ok.

hostname

000generic commented 1 year ago

Well - things haven't advanced past the first 8 jobs after several days now. It feels like it might be similar to before... Is there anything you might suggest I check or try?

moold commented 1 year ago

You can try to increase -k -w -f --kn --wn, or set --mode 0 or --mode 1, or --cn 1000 in minimap2_options_cns. BTW, such parameter settings may produce inaccurate result, we do not test before.

000generic commented 1 year ago

Good news - I left things running and one of the initial 8 jobs finished after 4 days and a second one after 5 - so two new jobs now running - and I'm hoping the next 6 will finish soon to advance through the remaining 26 jobs or so - currently they are at 850 CPU hours each. Seems like it will be 2-3 weeks to finish them all.

000generic commented 1 year ago

The remaining 6 jobs of the initial round finished at around 7 days / 1300 CPUs hours per 7-CPU job - so second round of 4-5 rounds is now fully underway. Not sure if this is normal timeframe for ~human genome size and ~45x coverage with 60 CPUs and half a Tb RAM. I'm estimating 5 weeks for this stage in the pipeline start to finish.

As long as it can finish, I'm very happy! If you have ideas for making it more efficient without going outside what is tested / known on your side for the parameters - I'd love to hear - but also I think you might have covered everything.

Thank you for your help on this!

moold commented 1 year ago

The running time is largely determined by genome complexity and input data size. For a ~human genome, it usually completes within 1-2 days. Obviously, the genome you assembled is highly repetitive (you can check this by k-mer spectrum using short reads), so you can try wtdbg2, which should be able to finish assembly very quickly.

000generic commented 1 year ago

Wow - so this is really running long already and still weeks to go.

I'm actually trying to improve on a wtdbg2 assembly - do you think NextDenovo is likely to offer improvement? It did great with the sponge data and was super fast. A different octopus with ONT reads took around 4-5 weeks a few months ago and it seemed reasonably good overall - very good considering the data going in I thought. I'll probably let it finish regardless out of curiosity at this point, as long as the machine isn't needed otherwise.

I can update how it goes! Thanks again.

moold commented 1 year ago

The assembly result is hard to say, because the genome you assembled is not normal, and the default parameters may not be suitable. But anyway, wait to finish this assembly task first.