Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

Does NextDenovo apply to diploid genome assembly? #66

Closed huangyixian123 closed 3 years ago

huangyixian123 commented 4 years ago

I try to assemble a diploid genome (7G diploid genome size, heterozygosis rate: 1.14% and repeat content rate: 80.4 based on 17-mers) using NextDenovo, but finally I just get a 3.3 G genome (N50: 749 Kb)and 47.4% complete busco. Could NextDenovo apply to diploid genome? If it doesn't, my 3.3 G genome is just a haploid genome, but why the busco rate is so low?

moold commented 4 years ago

What is your data type, CLR, HIFI or NanoPore ? How many data do you use for assembly? Do you using NGS reads to polish the genome before evaluating the BUSCO score? BTW, Please paste your config file to here.

huangyixian123 commented 4 years ago

Thanks. My data type are PacBio (400G) and NanoPore (100G) assembling in NextDenovo about 1 month. Draft assembly is polished using illumina data (1T) with NextPolish before evaluating the BUSCO score. And the config file of Nextdenovo is : [general] job_type = lsf job_prefix = nextDenovo task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = 3 parallel_jobs = 110 input_type = raw input_fofn = input.fofn workdir = 01_rundir cluster_options = -n {cpu}

[correct_option] read_cutoff = 1k seed_cutoff = 13539 blocksize = 2g pa_correction = 20 seed_cutfiles = 20 sort_options = -m 20g -t 10 -k 40 minimap2_options_raw = -x ava-ont -t 8 correction_options = -p 20

[assemble_option] random_round = 20 minimap2_options_cns = -x ava-ont -t 8 -k17 -w17 nextgraph_options = -a 1

And the config file of NextPolish is [General] job_type = local job_prefix = nextPolish task = best rewrite = yes rerun = 3 parallel_jobs = 100 multithread_jobs = 8 genome=01_rundir/03.ctg_graph/01.ctg_graph.sh.work/ctg_graph00/nextgraph.assembly.contig.fasta genome_size = auto workdir = nextpolish1 polish_options = -p {multithread_jobs}

[sgs_option] sgs_fofn = sgs.fofn sgs_options = -max_depth 1

moold commented 4 years ago
  1. I do not recommend to do assembly with PacBio and ONT data, it is better to assembly with PacBio or ONT data only, because they include different error pattern and align options.
  2. Most of common assemblers only output haploid contigs plus some alternative contigs in high heterozygous regions for a diploid genome with CLR or ONT reads. If you want to assemble two haploid genomes, it is much better to use HiFi data.
  3. sgs_options = -max_depth 1 should sgs_options = -max_depth 100, you can map RNA-seq reads or short genome reads to your assembly and calculate the mapping rate, to check whether the BUSCO score is right?