Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

Decreased N50 with higher sequencing depth #104

Open nadegeguiglielmoni opened 3 years ago

nadegeguiglielmoni commented 3 years ago

Hello,

I have been running some tests with NextDenovo 2.2 on one genome for which I have high coverages of PacBio and Nanopore reads. For both datasets separately, I tried subsampling the reads to different sequencing depths (10X, 20X... 100X). I found that at a 40-50X I would have the highest N50, but then with higher sequencing depths the N50 decreased. As the species is diploid with variable levels of heterozygosity, including some regions with high levels of heterozygosity, my hypothesis is that a higher sequencing depth gives more support to alternative haplotypes, and leads to breaks in the assembly. Could you give me some insights?

moold commented 3 years ago

Hi, could you provide your config files? BTW, you should update to the latest version.

nadegeguiglielmoni commented 3 years ago

We have updated NextDenovo for future projects.

Here is the config file:

[General]
job_type = local
job_prefix = ND_ont
task = assemble # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun = 10
parallel_jobs = 10
input_type = raw
input_fofn = ./input.fofn
workdir = ./run

[assemble_option]
minimap2_options_raw = -x ava-ont -t 10
random_round = 20
minimap2_options_cns = -x ava-ont -t 8 -k17 -w17
nextgraph_options = -a 1
seed_cutoff = HereSeedCutoff
moold commented 3 years ago

How about the seed_cutoff value for different depths?

nadegeguiglielmoni commented 3 years ago

We set it to 1001.

moold commented 3 years ago

OK, I think this may be the core of the problem,you can try to calculate seed_cutoff value using bin/seq_stat. see #103 . Usually, the assembly quality is affected by the reads length, not the depth.

nadegeguiglielmoni commented 3 years ago

Ok thank you, I will try optimizing the seed cutoffs.

nadegeguiglielmoni commented 3 years ago

Hello,

We ran the assemblies again with more adapter seed cutoffs. For PacBio assemblies, there is little change. For Nanopore assemblies, there is still a drop in N50 at 60X. The N50 is better for assemblies at 80X and 100X, but the BUSCO score is drastically decreased compared to previous assemblies.

moold commented 3 years ago

Thanks for your feedback, the assembly quality is not simply linear with the depth and length of the input data, it also depends on the characteristics of the genome. But, the BUSCO score should be similar, so could you share more details (assembly options and busco values) about the BUSCO score is drastically decreased compared to previous assemblies..

nadegeguiglielmoni commented 3 years ago

Hello,

The parameters were the same as before, except for seed cutoff.

Here are the results I had before with Nanopore reads: 40X: N50 = 11.5-14.5 Mb, single BUSCOs = 312-388, duplicated BUSCOs = 12-24 50X: N50 = 11.0-13.8 Mb, single BUSCOs = 362-393, duplicated BUSCOs = 14-27 60X: N50 = 4.7-8.1 Mb, single BUSCOs = 668-685, duplicated BUSCOs = 79-98 80X: N50 = 4.1-10.1 Mb, single BUSCOs = 665-695, duplicated BUSCOs = 78-91 100X: N50 = 2.6-7.0 Mb, single BUSCOs = 663-683, duplicated BUSCOs = 80-105

And here are the results with an "improved" seed cutoff: 40X: N50 = 11.6-14.7 Mb, single BUSCOs = 319-392, duplicated BUSCOs = 12-24 50X: N50 = 10.8-14.8 Mb, single BUSCOs = 348-386, duplicated BUSCOs = 19-23 60X: N50 = 6.0-8.8 Mb, single BUSCOs = 674-694, duplicated BUSCOs = 72-87 80X: N50 = 10.0-13.7 Mb, single BUSCOs = 362-398, duplicated BUSCOs = 19-31 100X: N50 = 10.7-12.4 Mb, single BUSCOs = 404-420, duplicated BUSCOs = 25-43

moold commented 3 years ago

Hi, Could you provide the estimated genome size and assembly size? Do you randomly subsample reads or just select the top longest reads?