Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

Reduce miss assembly #129

Closed ghost closed 2 years ago

ghost commented 2 years ago

Dear Dr. Hu,

I've recently run NextDenovo(v2.4.0) using ~140x PacBio data from a 700Mb genome. I checked the results against the reference genome, we found that the results in NextDenovo showed a misassembly. In other words, certain parts of two different chromosomes had become one contig. I would like to change the parameters and try again, but are there any parameters that should be improved? And if I change the parameters, from which stage should I start?

The parameters for the first run are as follows.

Thank you.


[General]
job_type = local
job_prefix = sample
task = all
rewrite = yes
deltmp = yes
parallel_jobs = 7
input_type = raw
read_type = clr # clr, ont, hifi
input_fofn = input.fofn
workdir = .

[correct_option]
read_cutoff = 4k
genome_size = 700m # estimated genome size
sort_options = -m 20g -t 12
minimap2_options_raw = -t 12
pa_correction = 7
correction_options = -p 12

[assemble_option]
minimap2_options_cns = -t 12
nextgraph_options = -a 1
moold commented 2 years ago

Hi, It is difficult to completely avoid miss assembly, so I recommend you to split the assembly by the reference genome or other data. But if you still want to tune some parameters, you can try to increase seed_cutoff (rerun all pipeline), or change -k -w in minimap2_options_cns (rerun from 02.cns_align/02.cns_align.sh), or -I -R -S -r -M -T -m in nextgraph_options (rerun from 03.ctg_graph).

ghost commented 2 years ago

Thank you for your reply. When I aligned the corrected reads with the results of other assemblers, no major errors were found, and the result of the fix seems to be fine. So I think the problem is in 02.cns_align/02.cns_align.sh or later. I'll try tuning the nextgraph_options first, which will give me quick results. Thank you very much.

ghost commented 2 years ago

Hi, Dr. Hu It has been a long time since I last contacted you.

Based on your previous suggestion, I examined the options for nextgraph_options. As a result, the set of parameters -a 1 -I 0.95 -R 0.30 -S 0.60 -r 0.30 -M 0.90 -T 0.50 -m 1.50 improved the results of the assembly. But to be honest, I don't understand these parameters very well. So I have two questions about this.

  1. I'm thinking I've made it a bit "stricter" than the default parameters, but is my understanding of this correct?
  2. Is the balance of the parameters appropriate? (i.e., are any of the values too large or too small?) I would like to know if there is anything I can improve in this regard.

Thank you.

moold commented 2 years ago

Hi, Thanks for your feedback. It is difficult to say how to set these parameters to get a best result, because each genome has some unique characteristics, and the default parameters are just a balance to ensure that most genomes can get good results. -I min test-to-best identity ratio means that for a given edge/node, it usually have many out edges, if the maximum identity is S, then any edge with identity <= S * I will be discarded, other parameters have similar meanings. For your question, I think -I may be too high, and others are OK. But, if you have evaluated the assembly results and found no big problems, there is no need to continue to tune parameters.

ghost commented 2 years ago

Thank you for your kind reply!