Closed RominaSSBatista closed 1 week ago
Could you paste the log content to here?
Hi @moold, here is the .log file from this run. Thank you in advance for your time.
Best, Romina
Could you paste the content of file /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly/01.raw_align/01.db_split.sh.work/db_split1/nextDenovo.sh.e
to here?
Could you paste the content of file
/gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly/01.raw_align/01.db_split.sh.work/db_split1/nextDenovo.sh.e
to here?
#!/bin/bash
set -xveo pipefail
hostname
cd /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly/01.raw_align/01.db_split.sh.work/db_split1
( time /home/users/rominab/softwares/NextDenovo/bin/seq_dump -f 10k -s 20000 -b 3g -n 12 -d /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly/01.raw_align /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly/input.fofn )
touch /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly/01.raw_align/01.db_split.sh.work/db_split1/nextDenovo.sh.done
nextDenovo.sh.e
, not nextDenovo.sh
I am afraid that nextDenovo.sh.e
file was not generated. Below I pasted the output if I run the tree
command from the output directory:
$ tree
.
├── 01.raw_align
│ ├── 01.db_split.sh
│ └── 01.db_split.sh.work
│ └── db_split1
│ └── nextDenovo.sh
├── 02.cns_align
├── 03.ctg_graph
├── input.fofn
├── run2.cfg
└── run.cfg
5 directories, 5 files
Does that mean NexDenovo did not even run anything?
I appreciate your thought about this issue.
Yes, have you tried clearing the directory and running again?
Also, you can set job_type = local
and try again.
I just made it:
tree
.
├── 01.raw_align
│ ├── 01.db_split.sh
│ ├── 01.db_split.sh.work
│ │ └── db_split1
│ │ ├── nextDenovo.sh
│ │ ├── nextDenovo.sh.e
│ │ └── nextDenovo.sh.o
│ ├── input.part.001.2bit
│ ├── input.seed.001.2bit
│ ├── input.seed.002.2bit
│ ├── input.seed.003.2bit
│ ├── input.seed.004.2bit
│ ├── input.seed.005.2bit
│ ├── input.seed.006.2bit
│ ├── input.seed.007.2bit
│ ├── input.seed.008.2bit
│ ├── input.seed.009.2bit
│ ├── input.seed.010.2bit
│ ├── input.seed.011.2bit
│ └── input.seed.012.2bit
├── 02.cns_align
├── 03.ctg_graph
├── input.fofn
├── run2.cfg
└── run.cfg
I am also sending attached the nextDenovo.sh.e file
Dear @moold, my job seems to be working better now, getting expected outputs. I just wonder how long it would take for this assembly to finish (as mentioned I have 35x coverage (after Guppy base call), expected genome size=2.7Gb). I have up to 7 days on the queue I am currently running, and had allocated -n24 --mem-per-cpu=5G. I am wondering if these are reasonable infrastructure. Best, Romina
In approximately 4h and so, my my job got killed, and I am sending the log file produced here pid85429.log.info.txt.
I am also sending here the .sh.e files that are mentioned at the end of the log sort_align02_nextDenovo.sh.e.txt
sort_align08_nextDenovo.sh.e.txt
sort_align12_nextDenovo.sh.e.txt
from my
$ tree
217 directories, 1171 files
I would appreciate if you could share your thoughts again.
Best, Romina Batista
The computer RAM is not enough, so it killed all subtaskes. You can run these failed tasks manually using command:
sh sort_align02_nextDenovo.sh
After you manually run all failed tasks, you can continue to run the main task.
The computer RAM is not enough, so it killed all subtaskes. You can run these failed tasks manually using command:
sh sort_align02_nextDenovo.sh
After you manually run all failed tasks, you can continue to run the main task.
Hi, @moold, thank you for your reply. I did run all failed tasks, interactively on the HPC I am working, it was very straight forward and fast, from my terminal. I am wondering now how I should continue the main task. I have used the following sbatch script to submit my job:
#!/bin/bash
###########################################
############## Romina Batista #################
############ r.d.s.d.s.batista@salford.ac.uk ########
################### 2024 ###################
###########################################
### 3. Run NextDenovo for a de novo assembly
### https://nextdenovo.readthedocs.io/en/latest/QSTART.html
### De Novo Assembly Project from Nanopore P2Solo 35x cov
#############################################
# I have used this issue on github to be able to
# tunning some parameters
# https://github.com/Nextomics/NextDenovo/issues/170
# and I have been discussing the use with the developer here:
# https://github.com/Nextomics/NextDenovo/issues/210
##############################################
#SBATCH --partition=long-serial
#SBATCH --time=168:00:00
#SBATCH -n 24
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=NextDeNovo
#SBATCH --error=/gws/nopw/j04/rotcotm/rominab/projects/error/assemb_nextdenov_9jun24.err
#SBATCH --output=/gws/nopw/j04/rotcotm/rominab/projects/error/assemb_nextdenov_9jun24.out
source /home/users/rominab/VirEnvPy3/bin/activate
NextDeNovo_dir='/home/users/rominab/software/NextDenovo'
config_dir='/gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly'
$NextDeNovo_dir/nextDenovo $config_dir/run2.cfg
where run2.cfg
file looks like below:
[General]
job_type = local # I set up as local, as a recommendation from the developer
job_prefix = nextDenovo
task = all # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun =
parallel_jobs = 12 # number of tasks used to run in parallel
input_type = raw
read_type = ont # clr, ont, hifi
input_fofn = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly/input.fofn
workdir = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly
submit = sbatch -p long-serial --cpus-per-task=24 --mem-per-cpu=5g -o /gws/nopw/j04/rotcotm/rominab/projects/error/nextdenovo.out -e /gws/nopw/j04/rotcotm/rominab/projects/error/nextdenovo.out /gws/nopw/j04/rotcotm/rominab/projects/scripts/assembly_NextDeNovo_test.sh
[correct_option]
read_cutoff = 10k
seed_cutoff = 20k
genome_size = 2.7g #estimated genome for Titi-Monkeys
blocksize = 3g
seed_cutfiles = 10
sort_options = -m 20g -t 30 -k 40
minimap2_options_raw = -x ava-pb -t 80
pa_correction = 12 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage.
correction_options = -p 12
[assemble_option]
random_round = 50
minimap2_options_cns = -t 12 -k 31 -w 17
minimap2_options_map = -t 12
nextgraph_options = -a 1
My main question here is if I use the same sbatch file to submit, does NextDenovo be able to recognise where it stopped? Should I adjust any line on my run2.cfg file?
Maybe
task = ??? # 'all', 'correct', 'assemble'
rewrite = ??? # yes/no
Many thanks for all your attention and time with this issue.
Best, Romina
Thank you! It looks like NextDeNovo crashed again in this next step: pid193698.log.info
and here is one one the sh.e file mentioned at the end of the log file: nextDenovo.sh.e
I wonder if I can fix this as I did before, by running manually those reported at the end of the log file?
Thank you, Romina
Yes
Dear @moold, as mentioned previously I run the 1st one manually and I got this at the end of stout msg on my terminal:
[909 INFO] 2024-06-19 05:09:41 Start a cns worker in 909 from parent 31397
[945 INFO] 2024-06-19 05:09:45 Start a cns worker in 945 from parent 31397
[955 INFO] 2024-06-19 05:09:50 Start a cns worker in 955 from parent 31397
[962 INFO] 2024-06-19 05:09:53 Start a cns worker in 962 from parent 31397
[978 INFO] 2024-06-19 05:09:57 Start a cns worker in 978 from parent 31397
[985 INFO] 2024-06-19 05:09:58 Start a cns worker in 985 from parent 31397
[993 INFO] 2024-06-19 05:09:59 Start a cns worker in 993 from parent 31397
[1009 INFO] 2024-06-19 05:10:03 Start a cns worker in 1009 from parent 31397
[1025 INFO] 2024-06-19 05:10:07 Start a cns worker in 1025 from parent 31397
[1029 INFO] 2024-06-19 05:10:08 Start a cns worker in 1029 from parent 31397
[1050 INFO] 2024-06-19 05:10:11 Start a cns worker in 1050 from parent 31397
[31612 INFO] 2024-06-19 05:10:15 Start a cns worker in 31612 from parent 31397
[1060 INFO] 2024-06-19 05:10:15 Start a cns worker in 1060 from parent 31397
[1063 INFO] 2024-06-19 05:10:16 Start a cns worker in 1063 from parent 31397
[1070 INFO] 2024-06-19 05:10:16 Start a cns worker in 1070 from parent 31397
[1074 INFO] 2024-06-19 05:10:17 Start a cns worker in 1074 from parent 31397
[1083 INFO] 2024-06-19 05:10:20 Start a cns worker in 1083 from parent 31397
[1084 INFO] 2024-06-19 05:10:20 Start a cns worker in 1084 from parent 31397
[1092 INFO] 2024-06-19 05:10:24 Start a cns worker in 1092 from parent 31397
[1093 INFO] 2024-06-19 05:10:24 Start a cns worker in 1093 from parent 31397
[2109 INFO] 2024-06-19 05:12:57 Start a cns worker in 2109 from parent 31397
[2111 INFO] 2024-06-19 05:12:57 Start a cns worker in 2111 from parent 31397
nextDenovo.sh: line 5: 31397 Killed /home/users/rominab/VirEnvPy3/bin/python /home/users/rominab/softwares/NextDenovo/lib/nextcorrect.py -f /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly/02.cns_align//01.seed_cns.input.idxs -i /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly/01.raw_align/03.sort_align.sh.work/sort_align01/input.seed.001.sorted.ovl -p 12 -min_len_seed 10000 -max_lq_length 1000 -r ont -o cns.fasta
indeed the cns.fasta is not empty, so I am assuming it finished well. Should I worry about this "killed" message? Below the files generated at this step, por this specific task and the size of the files:
[rominab@sci2 seed_cns01]$ tree
.
├── cns.fasta
├── cns.fasta.idx
├── nextDenovo.sh
├── nextDenovo.sh.e
└── nextDenovo.sh.o
0 directories, 5 files
and:
[rominab@sci2 seed_cns01]$ du -sh *
775M cns.fasta
795K cns.fasta.idx
4.0K nextDenovo.sh
70K nextDenovo.sh.e
4.0K nextDenovo.sh.o
I would appreciate, again, if you share your thoughts. Thank you, Romina
Yes, if there is a new file "nextDenovo.sh.done", it means this task is completed. The computer RAM is not enough, so you can try to reduce correction_options = -p 12
to correction_options = -p 5
, if this doesn't work, you need change a computer with more RAM.
Dear @moold, I manage to run those task one by one using --mem=500G on my slurm scripts (I used high mem here with the intention of not getting my job killed by the HPC). I did not change correction_options = -p 12
to correction_options = -p 5
. Adjust mem was enough to solve.
The next step also got stuck, and again, I manually finished all. It is very useful and easy, from the log file that your software generates, to find the next step to be fixed. I would strongly recommend adding this to your documentation. Although, somehow if I submit to the SLURM system it would not behave as good as running those small tasks interactively. Regardless, I finally managed to finish my assembly. Thank you a lot for your support during this week.
Just before closing this issue, I have a final question: Is there any way to tweak the parameters to improve N50? I am adding bellow the results from my stats and I got very worried by the poor draft genome I was able to generate using my data from Nanopore using NextDeNovo:
N50 of 0.1 Mb from ca. 1K contigs is way far from 2Mb I managed to recover running wtdbg2, although wtdbg2 used ca.20K contigs. The latter software is known to generate "short" N50. I was amazed by the fact NextDeNovo generated even worse N50.
[34217 INFO] 2024-06-19 22:41:08 asm stat:
[34217 INFO] 2024-06-19 22:41:08
Type Length (bp) Count (#)
N10 244613 139
N20 207530 328
N30 182044 545
N40 165150 789
N50 149992 1058
N60 134362 1355
N70 118331 1689
N80 101887 2073
N90 84225 2528
Min. 24628 -
Max. 844975 -
Ave. 135499 -
Total 422080978 3115
I would appreciate any thoughts you could share about how to improve N50 prior to run NextPolish to polish, since it is the next step I will take.
Best, Romina
A follow up from my previous comment is that I will try:
nextgraph_options = -a 1 -q 10
and compare my results from my previous run that was set as:
nextgraph_options = -a 1
Following FAQ
Best, Romina
Don't set seed_cutoff
, let Nextdenovo
calculate it automatically. I'm not sure if it will improve the assemble N50, but you can try it.
Currently running both:
parameter1
[General]
job_type = local # I set up as local, as a recommendation from the developer
job_prefix = nextDenovo
task = all # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun =
parallel_jobs = 12 # number of tasks used to run in parallel
input_type = raw
read_type = ont # clr, ont, hifi
input_fofn = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly_2/input.fofn
workdir = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly_2
submit = sbatch -p high-mem --cpus-per-task=24 --mem=256g -o /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/error/Assemb_nextgraph.out -e /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/error/Assemb_nextgraph.err /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/scripts/assembly_NextDeNovo_nextgraph.sh
[correct_option]
read_cutoff = 10k
seed_cutoff = 20k
genome_size = 2.7g #estimated genome for Titi-Monkeys
blocksize = 3g
seed_cutfiles = 10
sort_options = -m 20g -t 30 -k 40
minimap2_options_raw = -x ava-pb -t 80
pa_correction = 12 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage.
correction_options = -p 12
[assemble_option]
random_round = 50
minimap2_options_cns = -t 12 -k 31 -w 17
minimap2_options_map = -t 12
nextgraph_options = -a 1 -q 10
&
parameter2
[General]
job_type = local # I set up as local, as a recommendation from the developer
job_prefix = nextDenovo
task = all # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun =
parallel_jobs = 12 # number of tasks used to run in parallel
input_type = raw
read_type = ont # clr, ont, hifi
input_fofn = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly_2/input.fofn
workdir = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly_2
submit = sbatch -p high-mem --cpus-per-task=24 --mem=256g -o /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/error/Assemb_nextgraph.out -e /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/error/Assemb_nextgraph.err /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/scripts/assembly_NextDeNovo_nextgraph.sh
[correct_option]
read_cutoff = 10k
genome_size = 2.7g #estimated genome for Titi-Monkeys
blocksize = 3g
seed_cutfiles = 10
sort_options = -m 20g -t 30 -k 40
minimap2_options_raw = -x ava-pb -t 80
pa_correction = 12 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage.
correction_options = -p 12
[assemble_option]
random_round = 50
minimap2_options_cns = -t 12 -k 31 -w 17
minimap2_options_map = -t 12
nextgraph_options = -a 1 -q 10
I will let you know what perfomed better as soon as it finishes.
All your comments are much appreciated @moold!
Romina
Currently running both:
parameter1
[General] job_type = local # I set up as local, as a recommendation from the developer job_prefix = nextDenovo task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = parallel_jobs = 12 # number of tasks used to run in parallel input_type = raw read_type = ont # clr, ont, hifi input_fofn = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly_2/input.fofn workdir = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly_2 submit = sbatch -p high-mem --cpus-per-task=24 --mem=256g -o /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/error/Assemb_nextgraph.out -e /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/error/Assemb_nextgraph.err /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/scripts/assembly_NextDeNovo_nextgraph.sh [correct_option] read_cutoff = 10k seed_cutoff = 20k genome_size = 2.7g #estimated genome for Titi-Monkeys blocksize = 3g seed_cutfiles = 10 sort_options = -m 20g -t 30 -k 40 minimap2_options_raw = -x ava-pb -t 80 pa_correction = 12 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage. correction_options = -p 12 [assemble_option] random_round = 50 minimap2_options_cns = -t 12 -k 31 -w 17 minimap2_options_map = -t 12 nextgraph_options = -a 1 -q 10
Results - Stats for parameter1
Type Length (bp) Count (#)
N10 255304 320
N20 201338 796
N30 171770 1370
N40 151082 2032
N50 135789 2776
N60 122250 3602
N70 110702 4518
N80 98515 5533
N90 84163 6698
Min. 12547 -
Max. 1393551 -
Ave. 129917 -
Total 1063761647 8188
&
parameter2
[General] job_type = local # I set up as local, as a recommendation from the developer job_prefix = nextDenovo task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = parallel_jobs = 12 # number of tasks used to run in parallel input_type = raw read_type = ont # clr, ont, hifi input_fofn = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly_2/input.fofn workdir = /gws/nopw/j04/rotcotm/rominab/projects/NextDeNovo_Assembly_2 submit = sbatch -p high-mem --cpus-per-task=24 --mem=256g -o /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/error/Assemb_nextgraph.out -e /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/error/Assemb_nextgraph.err /gws/nopw/j04/rotcotm/rominab/projects/6_Pgrovesi_Nanopore/scripts/assembly_NextDeNovo_nextgraph.sh [correct_option] read_cutoff = 10k genome_size = 2.7g #estimated genome for Titi-Monkeys blocksize = 3g seed_cutfiles = 10 sort_options = -m 20g -t 30 -k 40 minimap2_options_raw = -x ava-pb -t 80 pa_correction = 12 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage. correction_options = -p 12 [assemble_option] random_round = 50 minimap2_options_cns = -t 12 -k 31 -w 17 minimap2_options_map = -t 12 nextgraph_options = -a 1 -q 10
Results - Stats for parameter2
Type Length (bp) Count (#)
N10 1899231 96
N20 1302148 252
N30 973543 469
N40 732325 755
N50 546282 1136
N60 404893 1653
N70 285287 2365
N80 195137 3383
N90 125235 4923
Min. 16174 -
Max. 5437132 -
Ave. 311254 -
Total 2410043747 7743
:heavy_exclamation_mark: In summary parameter2 performed better among all trials I did, but still not as I was expecting, hopefully I can improving that by polishing it.
:pushpin: Both assemblies finished in about 7h (very fast!) after setting --mem=256G.
Thank you very much @moold for your time and support, I will now close this issue.
Romina
Question or Expected behavior I am trying to use NextDenovo to assembly a genome for a non-human primate (genome size expected = 2.7g). Data generated from frozen tissue (storage >10yrs) P2Solo --> 35x coverage. Trying to assembly after run guppy6.3.8-gpu for base calling.
Job running for ca. 4 days. It generated the following:
It seems the job is not generating the expected output. I wonder If I need to change more parameters from my run.cfg (see below).
Operating system Which operating system and version are you using?
GCC What version of GCC are you using?
Python What version of Python are you using?
NextDenovo What version of NextDenovo are you using?
Additional context (Optional) After reading many issues from this repo I built the following run.cfg
I would appreciate any feedback, Best Regards Romina Batista