Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

nextdenovo+wtdbg2 #68

Closed fengyuanli304 closed 3 years ago

fengyuanli304 commented 4 years ago

Hi, I try to assemble the genome using NextDenovo and the results are list below. Type Length (bp) Count (#) N10 3673073 98 N20 2464102 258 N30 1741880 492 N40 1280863 815 N50 980011 1242 N60 792456 1787 N70 646695 2452 N80 509421 3282 N90 377582 4365

Min. 53312 - Max. 10006971 - Ave. 780287 - Total 4774575821 6119 [WARNING] 2020-04-16 02:09:44 Unfinished assembly, this is a limited version, currently only supports assembly for genome size < 3500000000 bp, please ask for help.

The genome size is about 2.9G. The genome feature high levels of repetitive sequences (60.05%) and heterozygosity(1.3%).

So I try to assemble the genome using Nextdenovo (correct) + wtdbg2, but I have no experience on them. At first, I use "wtdbg2 -x sq -g 2.9g -A -S 4 --node-drop 0.20 --node-len 2304 --node-max 200 -s 0.05 -e 3 -i ./merge.fastq.gz -t 20 -fo ./presix --rescue-low-cov-edges --no-read-length-sort --aln-dovetail 9216" The parameters received the following stats: Estimated: TOT 3897418752, CNT 18137, AVG 214888, MAX 11144960, N50 1269248, L50 788, N90 87296, L90 5730, Min 5632 then, I use "wtdbg2 -x ccs -g 2.9g -A -S 4 --node-drop 0.20 --node-len 2304 --node-max 200 -s 0.05 -t 10 -i ./merge_correct.fasta -fo ./preten --rescue-low-cov-edges --no-read-length-sort --aln-dovetail 9216" The parameters received the following stats: Estimated: TOT 4840935424, CNT 65790, AVG 73582, MAX 2623488, N50 115712, L50 9736, N90 36352, L90 42063, Min 4864

Wtdbg is much better than Nextdenovo (correct) + wtdbg2? Which parameters should I adjust to get a better assembly? Thank you very much.

moold commented 4 years ago
  1. If you want to do the assembly using Nextdenovo (correct) + wtdbg2, pls see #52, use -b option.
  2. For high heterozygous genome, you can try to the next version which I will release in a few weeks, which usually produce much better result than previous version.
  3. No software can fit all genomes unless you know how to adjust the parameters. So it’s always good to try a few more assemblers.

BTW, do not use -x ccs for error corrected reads.

fengyuanli304 commented 4 years ago

Thanks for the reply. I will try to use correction_options = -b.

fengyuanli304 commented 4 years ago

Hi, I have merged all files in 02.cns_align/01.get_cns.sh.work/get_cns/cns.fasta. However, the corrected file (70G) is smaller than the previous one (600G). Is this normal? Data were corrected with the following parameters: Suggested length cutoff of reads (genome size: 2900000000, expected seed depth: 30) to be corrected: 45747 bp [General] job_type = local job_prefix = nextDenovocorrect task = correct # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = 3 parallel_jobs = 4 input_type = raw input_fofn = ./input.fofn workdir = ./ [correct_option] read_cutoff = 2k seed_cutoff = 45k blocksize = 3g pa_correction = 4 seed_cutfiles = 4 sort_options = -m 25g -t 5 -k 30 minimap2_options_raw = -x ava-pb -t 5 correction_options = -b

[assemble_option] random_round = 20 minimap2_options_cns = -x ava-pb -t 5 -k17 -w17 nextgraph_options = -a 1

Best wishes

moold commented 4 years ago

NO, you should delete file 01.seed_cns.sh.done and seed_cns*/nextDenovocorrect.sh.done and rerun. BTW, seed depth: 30 is not enough, it is better to set it to 40-45.

fengyuanli304 commented 4 years ago

Hi, Thank you for your reply. I deleted 02.cns_align/01.get_cns.sh.done and 02.cns_align/01.get_cns.sh.work/get_cns* and rerun all the pipeline. Next, I will try 45.

fengyuanli304 commented 4 years ago

Hi I am sorry to bother you again. I tried seed depth=45. However, the assembly using corrected reads is worse than the assembly using raw reads. the genome was assembled with the following parameters: wtdbg2 -x sq -g 2.9g -A -S 4 --node-drop 0.20 --node-len 2304 --node-max 200 -s 0.5 -i ./merge_correct1k.fasta -t 20 -fo ./prethi --rescue-low-cov-edges --no-read-length-sort --aln-dovetail 9216 Could you give some suggestions on this genome assembly? Thanks

moold commented 4 years ago

Hi, you can try to re-assembly with the lastest version of NextDenovo, you can paste your config file to here before you start to run, so I can help you to check whether there is any error. You also can try to assemble using NextDenovo + smartdenovo

fengyuanli304 commented 4 years ago

Hi, Thank you for your reply. I will try to assembly with NextDenovo2.3.0. my config file: [General] job_type = local job_prefix = nextDenovo2.3.0 task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = 3 parallel_jobs = 4 input_type = raw input_fofn = ./input.fofn workdir = ./

[correct_option] read_cutoff = 1k seed_cutoff = 36k blocksize = 3g pa_correction = 4 seed_cutfiles = 4 sort_options = -m 25g -t 5 -k 30 minimap2_options_raw = -x ava-pb -t 5 correction_options = -p 15

[assemble_option] random_round = 20 minimap2_options_cns = -x ava-pb -t 5 -k17 -w17 nextgraph_options = -a 1

Best wishes.

moold commented 4 years ago

How many raw data/computer resources do you have?

fengyuanli304 commented 4 years ago

memory: 512G, storage space: 64T my raw data (gz): 69549900 KB

moold commented 4 years ago

pls provide cpu count and raw base count, not file size.

fengyuanli304 commented 4 years ago

Thank you. 104 cpu and 244,238,273,438 bp (total bases)

moold commented 4 years ago
[General]
job_type = local
job_prefix = nextDenovo2.3.0
task = all # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun = 3
parallel_jobs = 8
input_type = raw
input_fofn = ./input.fofn
workdir = ./

[correct_option]
read_cutoff = 1k
seed_cutoff = 36k
blocksize = 3g
pa_correction = 4
seed_cutfiles = 4
sort_options = -m 80g -t 25 -k 45
minimap2_options_raw = -x ava-pb -t 12
correction_options = -p 20

[assemble_option]
minimap2_options_cns = -x ava-pb -t 12 -k17 -w17
nextgraph_options = -a 1
fengyuanli304 commented 4 years ago

Thank you very much. I will do it as you advised.

fengyuanli304 commented 3 years ago

Hi, The program is already running for two months. Could I change this program to another node (memory: 1T, 240 CPU) and restart it? P.S. I try with the nextDenovo v2.1-beta.0: [General] job_type = local job_prefix = nextDenovo task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = 3 parallel_jobs = 4 input_type = raw input_fofn = ./input.fofn workdir = /home/lfy/software/NextDenovo/toxeus [correct_option] read_cutoff = 10k seed_cutoff = 53636 blocksize = 1g pa_correction = 4 seed_cutfiles = 4 sort_options = -m 25g -t 5 -k 50 minimap2_options_raw = -x ava-pb -t 5 correction_options = -p 5

[assemble_option] random_round = 10 minimap2_options_cns = -x ava-pb -t 5 -k17 -w17 nextgraph_options = -a 1 The genome size was estimated to be 2.91Gb, but total length is 4.77 G. High heterozygosity may cause this result.

moold commented 3 years ago

yes,you should change -t in minimap2 and parallel_jobs to maximize CPU usage.

LeoCao-X commented 3 years ago

Hi, Dr. Hu,I am trying to use NextDenovo to generate corrected reads with the method mentioned in @ #52 . I have run NextDenovo without -b option successfully before. But I could not find 02.cns_align/01.get_cns.sh.done and 02.cns_align/01.get_cns.sh.work/get_cns*/nextDenovo.sh.done file.
I run NextDenovo with following parameters job_type = local job_prefix = nextDenovo_20200915_V1 task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = 3 parallel_jobs = 20 input_type = raw input_fofn = ./input.fofn workdir = ./20200915_V1

[correct_option] read_cutoff = 2k seed_cutoff = 26084 blocksize = 3g pa_correction = 8 seed_cutfiles = 8 sort_options = -m 200g -t 10 -k 45 minimap2_options_raw = -x ava-ont -t 10 correction_options = -p 10

[assemble_option] random_round = 20 minimap2_options_cns = -x ava-ont -t 8 -k17 -w17 nextgraph_options = -a 1

Genome size was about 600m and sequencing reads size is 47G. What else detailed information should I provide?

LeoCao-X commented 3 years ago

NO, you should delete file 01.seed_cns.sh.done and seed_cns*/nextDenovocorrect.sh.done and rerun. BTW, seed depth: 30 is not enough, it is better to set it to 40-45.

In #52, you suggest deleting 02.cns_align/01.get_cns.sh.done and 02.cns_align/01.get_cns.sh.work/get_cns*/nextDenovo.sh.done. But in this issue, you suggest deleting different files. Is that because of the update of NextDenovo? I want to make sure whether I am on the right way. Thanks a lot.

moold commented 3 years ago

Yes

LeoCao-X commented 3 years ago

Yes Thanks for your reply and I have tried the suggested method. But I met a new error. Can you help me to figure out how solve it?

[INFO] 2020-09-20 00:07:43,011 seed_cns finished, and final corrected reads file: [INFO] 2020-09-20 00:07:43,011 /data/rawdata/genome_assembly/./20200915_V1_repeat/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fasta [INFO] 2020-09-20 00:07:43,323 analysis tasks done [INFO] 2020-09-20 00:07:43,377 skip step: cns_align [INFO] 2020-09-20 00:07:43,430 analysis tasks done [INFO] 2020-09-20 00:07:43,431 skip step: ctg_graph [INFO] 2020-09-20 00:07:44,308 analysis tasks done [INFO] 2020-09-20 00:07:44,309 skip step: ctg_align [INFO] 2020-09-20 00:07:44,740 analysis tasks done [INFO] 2020-09-20 00:07:44,741 skip step: ctg_cns [ERROR] 2020-09-20 00:07:44,742 Failed to find output file pattern for task: /data/rawdata/genome_assembly/20200915_V1_repeat/03.ctg_graph/03.ctg_cns.sh.work/ctg_cns6/nextDenovo_20200918_V1_correct.sh

moold commented 3 years ago

You can try to use the old version, not the latest version. The new version has changed a lot.

fengyuanli304 commented 3 years ago

Hi, I run nextdenovo2.3.0 a few days, and I meet the following error:

[ERROR] 2020-10-13 07:12:59,063 ctg_align failed: please check the following logs: [ERROR] 2020-10-13 07:12:59,075 /lustre2/nextdenovonewresult/03.ctg_graph/02.ctg_align.sh.work/ctg_align5/nextDenovo2.3.0.sh.e

the wrong file (nextDenovo2.3.0.sh.e):

hostname

Genome characteristics genome size: 2.9G, heterozygous rate: 1.30%, repeat content: 60.05%

Input data Total base count 244,238,273,438 bp, sequencing depth 84, average/N50 read length 37,906

Operating system Which operating system and version are you using? CentOS Linux relase 7.6.1810

GCC What version of GCC are you using? 4.8.5 20150623 (Red Hat 4.8.5-36)

Python What version of Python are you using? python2.7.18

NextDenovo What version of NextDenovo are you using? nextdenovo2.3.0

Can you help me to figure out how solve it? Thank you.

moold commented 3 years ago

If you want to ask help for questions not related to this topic, please open a new issue.

fengyuanli304 commented 3 years ago

Hi, I re-assembled the genome with NextDenovo2.3.0, but the Busco is low. Could you give me some suggestions? Parameters: [General] job_type = local job_prefix = nextDenovo2.3.0 task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = 3 parallel_jobs = 8 input_type = raw input_fofn = ./input.fofn workdir = ./

[correct_option] read_cutoff = 1k seed_cutoff = 36k blocksize = 3g pa_correction = 4 seed_cutfiles = 4 sort_options = -m 80g -t 25 -k 45 minimap2_options_raw = -x ava-pb -t 12 correction_options = -p 20

[assemble_option] minimap2_options_cns = -x ava-pb -t 12 -k17 -w17 nextgraph_options = -a 1

Results: Type Length (bp) Count (#) N10 9524294 28 N20 7278582 71 N30 5400301 128 N40 4361970 200 N50 3434731 290 N60 2824318 403 N70 2238539 540 N80 1649146 722 N90 962337 995

Min. 32875 - Max. 18687322 - Ave. 1966545 - Total 3502417316 1781

Busco INFO:

|Results from dataset arachnida_odb10        
--------------------------------------------------
|C:78.2%[S:53.9%,D:24.3%],F:0.8%,M:21.0%,n:2934   
|2293   Complete BUSCOs (C)                      
|1581   Complete and single-copy BUSCOs (S)       
|712    Complete and duplicated BUSCOs (D)       
|24 Fragmented BUSCOs (F)                     
|617    Missing BUSCOs (M)                       
|2934   Total BUSCO groups searched            
--------------------------------------------------

Next, I polished the assembly with nextPolish. Parameters: [General] job_type = local job_prefix = nextPolish task = 555121212 rewrite = yes rerun = 3 parallel_jobs = 6 multithread_jobs = 5 genome = ./nd.asm.fasta genome_size = auto workdir = ./01_rundir polish_options = -p {multithread_jobs}

[sgs_option] sgs_fofn = ./sgs.fofn sgs_options = -max_depth 80 -bwa

[lgs_option] lgs_fofn = ./lgs.fofn lgs_options = -min_read_len 10k -max_read_len 150k -max_depth 90 lgs_minimap2_options = -x map-pb

Results: Type Length (bp) Count (#) N10 9511552 28 N20 7272313 71 N30 5396198 128 N40 4360921 200 N50 3429793 290 N60 2821796 403 N70 2237830 540 N80 1648586 722 N90 960793 995

Min. 32664 - Max. 18658276 - Ave. 1963950 - Total 3497795421 1781

Busco INFO:

|Results from dataset arachnida_odb10            
--------------------------------------------------
|C:79.2%[S:45.9%,D:33.3%],F:0.3%,M:20.5%,n:2934  
|2324   Complete BUSCOs (C)                      
|1348   Complete and single-copy BUSCOs (S)      
|976    Complete and duplicated BUSCOs (D)        
|10 Fragmented BUSCOs (F)                    
|600    Missing BUSCOs (M)                       
|2934   Total BUSCO groups searched             
--------------------------------------------------

Thank you.