Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

Asking for parameters suggestion for NextDenovo using #89

Open LeoCao-X opened 3 years ago

LeoCao-X commented 3 years ago

Question or Expected behavior Hello, Ph.D. Hu, thanks for your development of such a powerful genome assembly software to help us assembly efficiently. I have tried NextDenovo for my organism genome but my results are not as perfect as I hope. Could you give me some suggestions? I used seq_stat to evaluate seed_cutoff before using NextDenovo. My genome is about 700M and I filtered 1k reads, my rawdata.fasta.gz is about 47G, expected corrected depth 45X. seq_stat result is 26084. The parameters I used for NextDenovo are

General]
job_type = local
job_prefix = nextDenovo_20200915_V1
task = all # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun = 3
parallel_jobs = 20
input_type = raw
input_fofn = ./input.fofn
workdir = ./20200915_V1_repeat

[correct_option]
read_cutoff = 2k
seed_cutoff = 26084
blocksize = 3g
pa_correction = 6
seed_cutfiles = 8
sort_options = -m 200g -t 10 -k 45
minimap2_options_raw = -x ava-ont -t 10
correction_options = -p 10

[assemble_option]
random_round = 20
minimap2_options_cns = -x ava-ont -t 6 -k17 -w17
nextgraph_options = -a 1

Operating system 4.14.65-gentoo

GCC gcc version 7.3.0 (Gentoo 7.3.0-r3 p1.4)

Python What version of Python are you using? Python 3.8.3

NextDenovo Nextdenovo v2.3.1

Assembly results are

Type Length (bp) Count (#)

N10 15976171 3

N20 13963409 7

N30 11335082 11

N40 8655004 17

N50 7715527 23

N60 5467147 31

N70 4377565 42

N80 3231799 56

N90 1273580 83

Min. 27303 -

Max. 20434899 -

Ave. 2415019 -

Total 528889270 219

After that, I used NextPolish to refine assembly results with NGS short reads and Nanopre long reads. The parameters are

[General]
job_type = local
job_prefix = nextPolish
task = best
rewrite = yes
rerun = 3
parallel_jobs = 12  
multithread_jobs = 10
genome = /data/nd.asm.fasta
genome_size = auto
workdir = ./
polish_options = -p {multithread_jobs}

[sgs_option]
sgs_fofn = ./sgs.fofn
sgs_options = -max_depth 200 -minimap2 

[lgs_option]
lgs_fofn = ./lgs.fofn
lgs_options = -min_read_len 2k -max_depth 100
lgs_minimap2_options = -x map-ont -t 10

Polish results are

Type Length (bp) Count (#)

N10 16193208 3

N20 14104345 7

N30 11490619 11

N40 8743256 17

N50 7793210 23

N60 5532933 31

N70 4452206 42

N80 3247878 56

N90 1278222 83

Min. 27510 -

Max. 20535269 -

Ave. 2440059 -

Total 534372990 219

Finally, I used BUSCO to evaluate polish results.

Results:

C:80.7%[S:80.7%,D:0.0%],F:11.0%,M:8.3%,n:1367
1103 Complete BUSCOs (C)
1103 Complete and single-copy BUSCOs (S)
0 Complete and duplicated BUSCOs (D)
150 Fragmented BUSCOs (F)
114 Missing BUSCOs (M)
1367 Total BUSCO groups searched

Obviously, BUSCO score is really low. Could you give some suggestions for NextDenovo or NextPolish using? My computing resource is about 300 G and 80 cores. Are there any more details should I supply ?

Thanks so much!

moold commented 3 years ago
  1. How many ngs reads/long reads can be mapped to the assembly?
  2. Have you tried any other assemblers? What are the assemblies size?
  3. Could you try to sgs_options = -max_depth 100 -bwa for NextPolish?
LeoCao-X commented 3 years ago
  1. I have not map ngs/long reads to the assembly. I will try to map short reads to assembly with bwa.
  2. I have tried wtdbg2 with default parameters and results are All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

contigs (>= 0 bp) 19621

contigs (>= 1000 bp) 19621

contigs (>= 5000 bp) 19406

contigs (>= 10000 bp) 13914

contigs (>= 25000 bp) 8058

contigs (>= 50000 bp) 3787

Total length (>= 0 bp) 841167373 Total length (>= 1000 bp) 841167373 Total length (>= 5000 bp) 840213028 Total length (>= 10000 bp) 801189177 Total length (>= 25000 bp) 706046327 Total length (>= 50000 bp) 554939331

contigs 19621

Largest contig 6804176
Total length 841167373 GC (%) 34.41
N50 88966
N90 16854
L50 1721
L90 10524

N's per 100 kbp 0.00

I used NextPolish to refine the assembly and run BUSCO for polished assembly.

BUSCO version is: 4.1.3

The lineage dataset is: insecta_odb10 (Creation date: 2019-11-20, number of species: 75, number of BUSCOs: 1367)

Results:

C:73.2%[S:73.2%,D:0.0%],F:12.4%,M:14.4%,n:1367
1001 Complete BUSCOs (C)
1001 Complete and single-copy BUSCOs (S)
0 Complete and duplicated BUSCOs (D)
169 Fragmented BUSCOs (F)
197 Missing BUSCOs (M)
1367 Total BUSCO groups searched

  1. I have tried run NextPolish with bwa model before and BUSCO results are

Results:

C:81.1%[S:81.1%,D:0.0%],F:10.3%,M:8.6%,n:1367
1108 Complete BUSCOs (C)
1108 Complete and single-copy BUSCOs (S)
0 Complete and duplicated BUSCOs (D)
141 Fragmented BUSCOs (F)
118 Missing BUSCOs (M)
1367 Total BUSCO groups searched

moold commented 3 years ago

If you are ensure your genome size is about 800M, try the following nextgraph_options:

  1. -a 0 -A
  2. -a 0 -n 45
  3. -a 0 -I 0.5
  4. -a 0 -q 5
  5. -a 0 -N 1
  6. -a 0 -u 1
  7. -a 0 -k
  8. -a 0 -I 0.1
  9. -a 0 -G

You can cd to directory 03.ctg_graph/01.ctg_graph.sh.work/ctg_graph0 and rerun nextgraph manually, it should be very fast for each version.

After that, you can choose the best options and set nextgraph_options in the config file and rerun the main task nextDenovo run.cfg, nextDenovo will backup your first assembly result and only rerun the assembly step.

LeoCao-X commented 3 years ago

I am not sure with the genome size. I use kmergenie with ngs short reads and it suggests that besk k is 111 and genome size is 730M.