Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

Asking for the ‘read_cutoff’ and 'seed_cutoff' parameters #103

Open Hans-zhao831 opened 3 years ago

Hans-zhao831 commented 3 years ago

Hi, Dr. Hu, thanks for developing such a powerful genome assembly software. Over the past half month, I've found that NextDenovo is the best in assembling the plant species I studied. The result made me so happy. But considering I haven't much experience in using the software, I'm still trying to obtain the best assembly results by adjusting the parameters, and in the process I have encountered some problems, so I would like to ask you for advice.

Before the consultation, I'll give you a quick overview of the project: PacBio data, ~110x raw data, diploid plants, 2g of genome size, 0.7% of heterozygosity, ~60% of repeat sequences, and nextDenovo v2.4.0.

Summary of raw data Category data
Base Num 228,333,067,000
Reads Num 18,120,500
>=2 K Reads Num 90%
>=5 k Reads Num 75%
>=7 k Reads Num 66%
>= 10 k Reads Num 53%
>=13 k Reads Num 42%
>=15 k Reads Num 35%
Mean Length 12k
N50 17k
Middle length 11k

1. How to detect the best parameters of read_cutoff and seed_cutoff, and their combinations ?

I obtained 4 versions based on different seed_cutoff and rest same parameters (read_cutoff=10k).

Run seed_cutoff (seed_depth) contig N50(M) contig Num contig size (G)
run1 19436 (50) 13.71 421 1.95
run2 20000 14.66 403 1.95
run3 20553 (45) 14.29 391 1.95
run4 24645 (30) 12.22 491 1.95

I also obtained 2 versions based on the two read_cutoff and rest same parameter (seed_cutoff=20k).

Run read_cutoff contig N50(M) contig Num contig size (G)
run2 10k 14.66 403 1.95
run5 1k 10.48 816 2.00

run2.cfg

[General]
job_type = sge 
job_prefix = nextDenovo 
task = all 
rewrite = no 
deltmp = yes 
rerun = 
parallel_jobs = 20 
input_type = raw 
input_fofn = run.fofn
read_type = clr
workdir = 01_rundir
cluster_options = auto

[correct_option]
read_cutoff = 10k
seed_cutoff = 20k
genome_size = 2g
blocksize = 3g   
pa_correction = 20
seed_cutfiles = 10 
sort_options = -m 20g -t 80 -k 40
minimap2_options_raw = -x ava-pb -t 80
correction_options = -p 80

[assemble_option]
random_round = 50
minimap2_options_cns = -x ava-pb -t 80 -k17 -w17
minimap2_options_map = -t 80
nextgraph_options = -a 1

Based on the above results, I confirm that seed-cutoff and read-cutoff have a big impact on the final assemble quality. However, I confused how to find the best value for each and the best combination of the two?

2. How can quickly obtain the final result after a few parameter changes without running from beginning to end.

Currently, I have to re-run the software from beginning to end after each parameter change, which takes a long time. Is there a way to quickly get the final result by modifying only one or a few parameters?

I look forward to your suggestions, and please don't hesitate to let me know if you need additional information.

moold commented 3 years ago

Thanks for your feedback!

  1. You can use seq_stat to calculate seed_cutoff, and the -d in seq_stat can usually be set to 30-45, so you need to try different values, and I don’t have a better suggestion, if I have a better value, I will set it as the default.
  2. If you change read_cutoff or seed_cutoff, you need to run it from beginning to end. If you change nextgraph_options, just run the main task again, NextDenovo will rerun the assembly step only.
Hans-zhao831 commented 3 years ago

Thanks for your reply.

Based on your experience, could you please provide a strategy for finding these optimal values (seed_cutoff and read_cutoff). For example,

  1. can the -f in seq_stat be considered as read_cutoff?
  2. are these two values distributed in linear or non-linear way?
  3. do we test the optimal value of each parameter individually, or need to consider different combinations of these two parameters or other parameters?
moold commented 3 years ago
  1. yes
  2. Not test
  3. Different combinations