aquaskyline / SOAPdenovo2

Next generation sequencing reads de novo assembler.
GNU General Public License v3.0
220 stars 78 forks source link

Segmentation fault with scaff step when use different MEGAHIT's output #27

Closed YiweiNiu closed 7 years ago

YiweiNiu commented 7 years ago

Hi, I used MEGAHIT to assemble reads into contigs, and then used SOAPdenovo-fusion, SOAPdenovo-127mer map, and SOAPdenovo-127mer scaff to scaffold the contigs. But when I removed one library in the MEGAHIT step, I got segmentation fault with scaff step.

Case 1: Code I used all libraries:

/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 38 --no-mercy -1 270B_R_1P.fastq,500B_R_1P.fastq,800B_R_1P.fastq,3k_1_R_1P.fastq,5k-1_R_1P.fastq,5k-2_R_1P.fastq,10k_R_1P.fastq -2 270B_R_2P.fastq,500B_R_2P.fastq,800B_R_2P.fastq,3k_1_R_2P.fastq,5k-1_R_2P.fastq,5k-2_R_2P.fastq,10k_R_2P.fastq

/software/SOAPdenovo2-r241/SOAPdenovo-fusion -D -s config -p 40 -K 63 -g k63 -c ../megahit_out/final.contigs.fa
/software/SOAPdenovo2-r241/SOAPdenovo-127mer map -s config -p 40 -g k63
/software/SOAPdenovo2-r241/SOAPdenovo-127mer scaff -p 40 -g k63

Case 2: Code I removed one library:

/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 20 --no-mercy -1 500B_R_1P.fastq,800B_R_1P.fastq,3k_1_R_1P.fastq,5k-1_R_1P.fastq,5k-2_R_1P.fastq,10k_R_1P.fastq -2 500B_R_2P.fastq,800B_R_2P.fastq,3k_1_R_2P.fastq,5k-1_R_2P.fastq,5k-2_R_2P.fastq,10k_R_2P.fastq -o megahit_out.no270

/software/SOAPdenovo2-r241/SOAPdenovo-fusion -D -s config -p 40 -K 63 -g k63_1 -c ../megahit_out.no270/final.contigs.fa
/software/SOAPdenovo2-r241/SOAPdenovo-127mer map -s config -p 40 -g k63_1
/software/SOAPdenovo2-r241/SOAPdenovo-127mer scaff -p 40 -g k63_1

The configure file I used is the same:

#maximal read length
max_rd_len=151
[LIB]
avg_ins=500
reverse_seq=0
asm_flags=2
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=500B_R1.fastq
q2=500B_R2.fastq
[LIB]
#average insert size
avg_ins=800
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=2
#in which order the reads are used while scaffolding
rank=3
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=800B_R1.fastq
q2=800B_R2.fastq
[LIB]
avg_ins=3000
reverse_seq=1
asm_flags=2
rank=3
# cutoff of pair number for a reliable connection (at least 5 for large insert size)
pair_num_cutoff=4
#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)
map_len=35
q1=3k_1_R1.fastq
q2=3k_1_R2.fastq
[LIB]
avg_ins=5000
reverse_seq=1
asm_flags=2
rank=4
# cutoff of pair number for a reliable connection (at least 5 for large insert size)
pair_num_cutoff=5
#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)
map_len=35
q1=5k-1_R1.fastq
q2=5k-1_R2.fastq
[LIB]
avg_ins=5000
reverse_seq=1
asm_flags=2
rank=4
# cutoff of pair number for a reliable connection (at least 5 for large insert size)
pair_num_cutoff=5
#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)
map_len=35
q1=5k-2_R1.fastq
q2=5k-2_R2.fastq
[LIB]
avg_ins=10000
reverse_seq=1
asm_flags=2
rank=5
# cutoff of pair number for a reliable connection (at least 5 for large insert size)
pair_num_cutoff=5
#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)
map_len=35
q1=10k_R1.fastq
q2=10k_R2.fastq

The message I got when I removed one library:

All contigs loaded.
Mapping & Scaffolding module.
[main]Data prepare mode selected .

Version 2.04: released on July 13th, 2012
Compile Mar 19 2017 10:58:55

********************
Map
********************

Parameters: map -s config -p 40 -g k63_1 

Kmer size: 63.
Contig length cutoff: 65.

782765 contig(s), maximum sequence length 61533, minimum sequence length 200, maximum name length 10.
Time spent on parsing contigs file: 1s.
40 thread(s) initialized.
Time spent on hashing contigs: 43s.
546166365 node(s) allocated, 630731530 kmer(s) in contigs, 630731530 kmer(s) processed.
Time spent on graph construction: 44s.

Time spent on aligning long reads: 0s.

In file: config, max seq len 151, max name len 256
40 thread(s) initialized.
1565513 edge(s) in the graph.
Import reads from file:
 500B_R1.fastq
Import reads from file:
 500B_R2.fastq
Current insert size is 500, map_len is 32.
--- 100000000th reads.
Import reads from file:
 800B_R1.fastq
Import reads from file:
 800B_R2.fastq
Current insert size is 800, map_len is 32.
--- 200000000th reads.
Import reads from file:
 3k_1_R1.fastq
Import reads from file:
 3k_1_R2.fastq
Current insert size is 3000, map_len is 35.
--- 300000000th reads.
Import reads from file:
 5k-1_R1.fastq
Import reads from file:
 5k-1_R2.fastq
Current insert size is 5000, map_len is 35.
--- 400000000th reads.
--- 500000000th reads.
Import reads from file:
 5k-2_R1.fastq
Import reads from file:
 5k-2_R2.fastq
Current insert size is 5000, map_len is 35.
--- 600000000th reads.
Import reads from file:
 10k_R1.fastq
Import reads from file:
 10k_R2.fastq
Current insert size is 10000, map_len is 35.
--- 700000000th reads.

Total reads         776836032
Reads in gaps       183984010
Ratio               23.7%
Reads on contigs    420553855
Ratio               54.1%
6 pe insert size, the largest boundary is 776836032.

LIB(s) information:
 [LIB] 0, avg_ins 500, reverse 0.
 [LIB] 1, avg_ins 800, reverse 0.
 [LIB] 2, avg_ins 3000, reverse 1.
 [LIB] 3, avg_ins 5000, reverse 1.
 [LIB] 4, avg_ins 5000, reverse 1.
 [LIB] 5, avg_ins 10000, reverse 1.
Time spent on aligning reads: 6541s.

Overall time spent on alignment: 109m.

Version 2.04: released on July 13th, 2012
Compile Mar 19 2017 10:58:55

********************
Scaff
********************

Parameters: scaff -p 40 -g k63_1 

k63_1.Arc: no such file or empty file!

There are 6 grad(s), 776836032 read(s), max read len 151.
Kmer size: 63
There are 1565513 edge(s) in edge file.
Mask contigs with coverage lower than 0.3 or higher than 6.0, and strict length 0.
Average contig coverage is 3, 0 contig(s) masked.
Mask contigs shorter than 65, 0 contig(s) masked.
0 arc(s) loaded, average weight is 0.
/opt/gridview//pbs/dispatcher/mom_priv/jobs/23301.admin.SC: line 19: 72517 Segmentation fault (core dumped) /home/software/SOAPdenovo2-r241/SOAPdenovo-127mer scaff -p 40 -g k63_1               (core dumped) /home/software/SOAPdenovo2-r241/SOAPdenovo-127mer scaff -p 40 -g k63_1

I've also tried with or without '-F', and still got the error.

Best regards, Yiwei Niu

aquaskyline commented 7 years ago

Megahit and SOAPdenovo2 are two pieces of software , changing the parameters of Megahit won't affect it's compatibility to SOAPdenovo2. I'm not sure with the limited information you provided, but the problem might be some incorrect file sharing between the two experiments you carried out. I would suggest you to rerun the experiment again using some slightly different parameters like p32 and k61. Another suggestion is, using mate-pairs (insert size >1k) for generating contigs is not recommended. You might want to use 270, 500 and 800 in Megahit only.

ohmiya commented 7 years ago

I ran into the segment fault in the scaffolding step, too. After compiling SOAPdenovo2 with -g option, I executed it again. Any help is appreciate.

My command is here: /home/ohmiya/tools/SOAPdenovo2/SOAPdenovo2-master/SOAPdenovo-127mer all -K 127 -s /home/ohmiya/soap.config -R -o /home/ohmiya/soap_out

My config file is here: max_rd_len=300 [LIB] avg_ins=500 reverse_seq=0 asm_flag=3 rank=1 q1=/sshare1/home/ohmiya/Forward.fastq q2=/sshare1/home/ohmiya/Reverse.fastq q=/sshare1/home/ohmiya/all.fastq

Stderr message in the scaffolding step is here:


Scaff


Parameters: scaff -g /home/ohmiya/soap_out

Files for scaffold construction are OK.

There are 1 grad(s), 4838366 read(s), max read len 300. Kmer size: 127 There are 101094 edge(s) in edge file. Mask contigs with coverage lower than 0.9 or higher than 18.0, and strict length 0. Average contig coverage is 9, 14518 contig(s) masked. Mask contigs shorter than 129, 9924 contig(s) masked. 42662 arc(s) loaded, average weight is 6. 50547 contig(s) loaded. Done loading updated edges. Time spent on loading updated edges: 0s.

File /home/ohmiya/data/soap_out/ensemble_soap.links exists, skip creating the links... Time spent on loading paired-end reads information: 0s.


Start to construct scaffolds.


For insert size: 500 Total PE links 54278 PE links to masked contigs 43026 On same scaffold PE links 0 Cutoff of PE links to make a reliable connection: 3 Active connections 22496 Weak connections 15344 Weak ratio 68.2% 390 circles removed. Start to remove transitive connection. Total contigs 101094 Masked contigs 26002 Remained contigs 75092 None-outgoing-connection contigs 71799 (95.614716%) Single-outgoing-connection contigs 3124 Multi-outgoing-connection contigs 4 Cycle 1 Two-outgoing-connection contigs 165 Potential transitive connections 1 Transitive connections 1 Transitive ratio 0.6% Cycle 2 Two-outgoing-connection contigs 164 Potential transitive connections 0 Transitive connections 0 Transitive ratio 0.0% Start to linearize sub-graph. Picked sub-graphs 135 Connection-conflict 0 Significant overlapping 116 Eligible 0 Bubble structures 1 Mask repeats: Puzzles 118 Masked contigs 114 Start to remove transitive connection. Total contigs 101094 Masked contigs 26232 Remained contigs 74862 None-outgoing-connection contigs 71898 (96.040710%) Single-outgoing-connection contigs 2960 Multi-outgoing-connection contigs 0 Cycle 1 Two-outgoing-connection contigs 4 Potential transitive connections 0 Transitive connections 0 Transitive ratio 0.0% Start to linearize sub-graph. Picked sub-graphs 1 Connection-conflict 0 Significant overlapping 1 Eligible 0 Bubble structures 0 Non-strict linearization. Start to linearize sub-graph. Picked sub-graphs 1 Connection-conflict 0 Significant overlapping 0 Eligible 0 Bubble structures 0 Start to mask puzzles. Masked contigs 3 Remained puzzles 0 Segmentation error

Content of the core file is here: Core was generated by `/home/ohmiya/tools/SOAPdenovo2/SOAPdenovo2-master/SOAPdenovo-127mer all -K 127'. Program terminated with signal 11, Segmentation fault.

0 0x000000000044a1d8 in validConnect (ctg=101124, preCNT=0x0) at orderContig.c:890

890 if ( !cn_temp->deleted && !cn_temp->mask ) warning: File "/usr/local/lib64/libstdc++.so.6.0.20-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/home/ohmiya/go/src/runtime/runtime-gdb.py". To enable execution of this file add add-auto-load-safe-path /usr/local/lib64/libstdc++.so.6.0.20-gdb.py line to your configuration file "/home/ohmiya/.gdbinit". To completely disable this security protection add set auto-load safe-path / line to your configuration file "/home/ohmiya/.gdbinit". For more information about this security protection see the "Auto-loading safe path" section in the GDB manual. E.g., run from the shell: info "(gdb)Auto-loading safe path" (gdb) where

0 0x000000000044a1d8 in validConnect (ctg=101124, preCNT=0x0) at orderContig.c:890

1 0x000000000044a43c in getNextContig (ctg=101125, preCNT=0x2965d58, exception=0x7fff5142a237 "") at orderContig.c:969

2 0x000000000044e9a2 in freezing () at orderContig.c:3136

3 0x000000000044cc13 in ordering (deWeak=1 '\001', downS=0 '\000', nonlinear=1 '\001',

infile=0x74ac80 <graphfile> "/sshare1/home/ohmiya/data/project/P249_1707_Metagenome.TLR3KO_DNA_virus/d249_04_ensemble.SAVacC_kmax_cdhit.mix.KO_pel_9w.qcl/soap_out/ensemble_soap") at orderContig.c:2390

4 0x000000000045908d in Links2Scaf (

infile=0x74ac80 <graphfile> "/sshare1/home/ohmiya/data/project/P249_1707_Metagenome.TLR3KO_DNA_virus/d249_04_ensemble.SAVacC_kmax_cdhit.mix.KO_pel_9w.qcl/soap_out/ensemble_soap") at orderContig.c:5903

5 0x00000000004805d8 in call_scaffold (argc=3, argv=0x7fff5142a7f0) at scaffold.c:83

6 0x0000000000444e13 in pipeline (argc=8, argv=0x7fff5142aa60) at main.c:542

7 0x000000000044393e in main (argc=8, argv=0x7fff5142aa60) at main.c:96

aquaskyline commented 7 years ago

@ohmiya it seems that you only have a paired-end library with 500bp insert size. With only that, scaffolding will not increase the contiguity of your assembly.

ohmiya commented 7 years ago

In addition to the paired-end reads, I have a single-end library. In my config file, q=/sshare1/home/ohmiya/all.fastq Why the scaffolding doesn't work with only a paired-end library even though we perform the assembly before it?

aquaskyline commented 7 years ago

Scaffolding requires mate-pairs with insert size ≥1kbp. Paired-end reads can be used for scaffolding, but the improvement won't be significant.