adigenova / wengan

An accurate and ultra-fast hybrid genome assembler
GNU Affero General Public License v3.0
84 stars 14 forks source link

Does it run on computer cluster and how to continue running unfinished tasks? #39

Closed Hans-zhao831 closed 3 years ago

Hans-zhao831 commented 3 years ago

Hello, I am trying to assemble a plant genome using different genome assembly software. I am very interested in wengan, but I have some problems.

  1. Do wengan supports computer cluster (e.g. sge) and continue running unfinished tasks?
  2. Do you support some assembler or alignment pipeline in wengan? And then use wengan to integrate the data and produce the finally assemble genome. I guess it be faster than the default pipeline, especially if running unfinished tasks is not supported.
  3. A range of read coverage (including short read and long read) was recommended. Is this the best range? If the actual data exceeds this range, will the assembly result be worse?
adigenova commented 3 years ago

Hi Hans,

  1. Do wengan supports computer cluster (e.g. sge) and continue running unfinished tasks? Wengan is designed to run in a single machine, It can continue unfinished task because Wengan generates a makefile (*.mk) to control its execution. You can use your cluster scheduler (e.g sge) to submit Wengan jobs, but they will be executed on a single machine.
  2. Do you support some assembler or alignment pipeline in wengan? The current version of Wengan supports 3 different short-read assemblers (Minia3, Abyss, and DiscoVarDenovo). The other components of the pipeline were designed specifically for Wengan and include tools for error-correct short-read contigs (intervalmiss), alignment of short and long-reads (fastmin-sg), and liger that is the final module that implements the SSG graph.
  3. We recommend 50X and 30X of coverage for short and long reads respectively. Increasing the short-read coverage over 50X is not very useful and worst short-read assemblies might be obtained. Additionally. more short-read coverage increases the computational resources needed to complete the assembly. For long-read we have done assemblies with 90X coverage and the results are similar or better to the one using only 30X. Thus you can increase the long-read coverage if you have the reads.

Best,

Alex

zihhuafang commented 3 years ago

Hi Alex (@adigenova),

On the topic of coverage, we have ~30X of short reads and ~40X of ONT reads (N50 ~30Kb) for a genome that is the similar size of human. Is it better to run on M mode or D mode?

Was trying on the D mode but got this error message (see below). Not sure what the problem is.

export MALLOC_PER_THREAD=1
/wengan/wengan-v0.2-bin-Linux/bin/DiscovarExp READS="Illumina/203_tursiops_unclass_Clean_R_1.fastq.gz,Illumina/203_tursiops_unclass_Clean_R_2.fastq.gz" OUT_DIR=/tmp/asm_wenganDD NUM_THREADS=32 2> asm_wenganD.Disco_denovo.err > asm_wenganD.Disco_denovo.log
asm_wenganD.mk:4: recipe for target 'asm_wenganD.contigs-disco.fa' failed
make: *** [asm_wenganD.contigs-disco.fa] Error 1

In asm_wenganD.Disco_denovo.log

1: 60 bases , 31 quals
2: 60 bases , 31 quals
See inconsistent base/quality lengths in Illumina/203_tursiops_unclass_Clean_R_1.fastq.gz or Illumina/203_tursiops_unclass_Clean_R_2.fastq.gz

Not sure what this mean. We did the standard QC for our short reads.

Would appreciate your advice! Thanks Zih-Hua

adigenova commented 3 years ago

Hi Zih-Hua,

Is it better to run on M mode or D mode? Wengan achieves better results with the D mode, but the D mode requires more memory than the other ones. For a 3Gb genome at 60X short read coverage the D mode need about 600Gb, for lower coverage ~30X, it would require about 300Gb.

Regarding the error message, DiscovarDenovo (Disco for short) is complaining that there are short-reads in your dataset with inconsistencies in the lengths of quality and bases (probably a corrupt fastq file). My recommendation is to give the raw short-reads as input to Wengan, because Disco error-correct the short-read data using sophisticated algorithms that are more convenient than just trimming reads based on single read qualities. Additionally. reads shorter than 60bp are not supported by Disco and also stop its execution. You can check that your reads are longer than 60bp using fastp for instance. Best Alex

zihhuafang commented 3 years ago

Dear Alex,

Thanks for the reply. Just one quick question about trimming the reads. Our reads are generated from NovoSeq, so there is poly-G tail for each read. I guess I would still need to trim it before putting the reads to Wengan?

Thanks. Zih-Hua

adigenova commented 3 years ago

Yes, you can trim that tail, but be sure that all the reads are longer than 60bp. Best,

Alex