Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

Help with parameters for desktop run #150

Closed bensprung closed 1 year ago

bensprung commented 1 year ago

Hi, I can't figure out how to set various parameters in a self-consistent way. I'm on a desktop with 6 cores, 32 GB RAM, working with a yeast genome of ~12 mb and about 230x coverage with ONT reads, so about 2.8e09 bases. The FAQ says:

parallel_jobs = M/64 #here, 64 can optimize to 32~64
...

[correct_option]
pa_correction = M/(TOTAL_INPUT_BASES * 1.2/4)
sort_options = -m TOTAL_INPUT_BASES * 1.2/4g -t P/pa_correction
correction_options = -p P/pa_correction
minimap2_options_raw = -t P/parallel_jobs
...

[assemble_option]
minimap2_options_cns = -t P/parallel_jobs

Since parallel_jobs comes to 32/64, I assume I should set it to 1? pa_correction comes to 32e09/(2.8e09*1.2/4) = 38. But then P/pa_correction = 6/38 << 1, so I'm not sure how to proceed.

I also got the following warning:

*Suggested seed_cutoff (genome size: 12.00Mb, expected seed depth: 45, real seed depth: 25.00): 8721 bp
*NOTE: The read/seed length is too short, and the assembly result is unexpected and please check the assembly quality carefully. Of course, it's better to sequencing more longer reads and try again.

I left read_cutoff = 1k and I set genome_size = 12m.

moold commented 1 year ago
  1. You can not run NextDenovo on a 32 GB RAM Computer, the RAM is too small.
  2. For the warning, it means the length of input ONT data is too short for NextDenovo.

So, I suggest you can try with other assemblers.

bensprung commented 1 year ago

OK. How much RAM is the minimum, for a 12 Mbp genome? (And I do have more ONT reads, I only gave it a subset to try it out. I think I have up to 800x coverage. Definitely 400x. )

bensprung commented 1 year ago

FWIW I did get a reasonable-looking assembly out using these parameters:

[General]
job_type = local # local, slurm, sge, pbs, lsf
job_prefix = nextDenovo
task = all # all, correct, assemble
rewrite = yes # yes/no
deltmp = yes
parallel_jobs = 1 # number of tasks used to run in parallel
input_type = raw# raw, corrected
read_type = ont # clr, ont, hifi
input_fofn = input.fofn
workdir = BGS1_uncorr_nextDenovo

[correct_option]
read_cutoff = 1k
genome_size = 12m # estimated genome size
sort_options = -m 8g -t 4 
minimap2_options_raw = -t 4 
correction_options = -p 1 

[assemble_option]
minimap2_options_cns = -t 4
nextgraph_options = -a 1
moold commented 1 year ago
  1. It depends on the input data size, reads max length, genome size, et al, so it is hard to say.
  2. You can select the top 60X longest ONT reads to do the assembly.
bensprung commented 1 year ago

Thanks. What do you mean by the top 60X longest? Select the longest reads sufficient to give 60X coverage?

moold commented 1 year ago

Yes

bensprung commented 1 year ago

Got it. Thank you.