aquaskyline / SOAPdenovo2

Next generation sequencing reads de novo assembler.
GNU General Public License v3.0
220 stars 78 forks source link

Config file #59

Closed SIlviaHinojosa closed 5 years ago

SIlviaHinojosa commented 5 years ago

Hi

I am trying to use multiple illumina reads to assemble a cactus genome but I think I m not doing the confing file properly since it seem that Soap is only using the first pair of reads. Can you help me?

maximal read length

max_rd_len=150 [LIB]

average insert size of the library

avg_ins=350

if sequences are forward-reverse of reverse-forward

reverse_seq=0

in which part(s) the reads are used (only contigs, only scaffolds, both contigs and scaffolds, only gap closure)

asm_flags=3

cut the reads to the given length

rd_len_cutoff=100

in which order the reads are used while scaffolding

rank=1

cutoff of pair number for a reliable connection (at least 3 for short insert size)

pair_num_cutoff=3

minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)

map_len=32

Pair1

q1=/users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/Mammillaria_03_R1.fastq.gz q2=/users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/Mammillaria_03_R2.fastq.gz

Pair2

q1=/users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/Mammillaria_04_R1.fastq.gz q2=/users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/Mammillaria_04_R2.fastq.gz

Pair3

q1=/users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/Mammillaria_05_R1.fastq.gz q2=/users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/Mammillaria_05_R2.fastq.gz

cchd0001 commented 5 years ago

Hi , I copy your config file and test it with my own data , it finished without eror . Could your please upload the first 20 lines of each your input file so that I can do more debug running ? Thanks

SIlviaHinojosa commented 5 years ago

Hi

I ran it again and it has been running for 22 days know, but still just reading all he reads... Here are the 20 lines you asked:

Pregraph


Parameters: pregraph -s /users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/conf ig_file_2 -K 63 -p 24 -R -o /users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/ graph_prefix_2

In /users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/config_file_2, 15 lib(s), maximum read length 150, maximum name length 256.

24 thread(s) initialized. Import reads from file: /users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/Mammillaria_03_R1.fastq.gz Import reads from file: /users-d1/shinojosa/Mammillaria_Illumina/Mammilaria/Mammillaria_03_R2.fastq.gz --- 100000000th reads. --- 200000000th reads. --- 300000000th reads. --- 400000000th reads. --- 500000000th reads. --- 600000000th reads. --- 700000000th reads. --- 800000000th reads. --- 900000000th reads.

cchd0001 commented 5 years ago

I also have been dealt with this issue recently. Here is my finding : The reason of ultra-long time cost of the processing of later loaded reads is :

  1. too much hash re-alloc happened .
  2. too mush hash keys conflict to be solved. Which caused , comparing with the reads that loaded at start, the same numbers of reads that loaded later need much more time to processe . For a huge genome that cost hundreds of Gigabyte , try the -a parameter to reduce the pregraph time cost. When use -a , always try a litter bigger size that it acctually neeeded .
    The biggest number of -a can take is 700 ( GB) , hope it is big enough for your genome . Best wishes.