adigenova / wengan

An accurate and ultra-fast hybrid genome assembler
GNU Affero General Public License v3.0
84 stars 14 forks source link
genome-assembler hybrid illumina nanopore pacbio

HitCount

Wengan

An accurate and ultra-fast genome assembler

Version: 0.2 (18/05/2020)

Table of Contents

SYNOPSIS

# Assembling Oxford Nanopore and Illumina reads with WenganM
 wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000

# Assembling PacBio reads and Illumina reads with WenganA
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and BGI reads with WenganM
 wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000

# Hybrid long-read only assembly of PacBio Circular Consensus Sequence and Nanopore data with WenganM
 wengan.pl -x ccsont -a M -l ont.fastq.gz -b ccs.fastq.gz -p asm4 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and Illumina reads with WenganD (need a high memory machine 600GB)
 wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000

# Assembling pacraw reads with pre-assembled short-read contigs from Minia3
 wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa

# Assembling pacraw reads with pre-assembled short-read contigs from Abyss
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa

# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo
 wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa

Description

Wengan is a new genome assembler that, unlike most of the current long-reads assemblers, avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph. To achieve this, Wengan builds a new sequence graph called the Synthetic Scaffolding Graph (SSG). The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by performing a transitive reduction of the edges. Another distinct feature of Wengan is that it performs self-validation by following the read information. Wengan identifies miss-assemblies at different steps of the assembly process. For more information about the algorithmic ideas behind Wengan, please read the preprint available in bioRxiv.

Short-read assembly

Wengan uses a de Bruijn graph assembler to build the assembly backbone from short-read data. Currently, Wengan can use Minia3, Abyss2 or DiscoVarDenovo. The recommended short-read coverage is 50-60X of 2 x 150bp or 2 x 250bp reads.

WenganM [M]

This Wengan mode uses the Minia3 short-read assembler. This is the fastest mode of Wengan and can assemble a complete human genome in less than 210 CPU hours (~50GB of RAM).

WenganA [A]

This Wengan mode uses the Abyss2 short-read assembler. This is the lowest memory mode of Wengan and can assemble a complete human genome with less than 40GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

WenganD [D]

This Wengan mode uses the DiscovarDenovo short-read assembler. This is the greedier memory mode of Wengan and for assembling a complete human genome needs about 600GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

Long-read presets

The presets define several variables of the Wengan pipeline execution and depend on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X.

ontlon

preset for raw ultra-long-reads from Oxford Nanopore, typically with an N50 > 50kb.

ontraw

preset for raw Nanopore reads typically with an N50 ~[15kb-40kb].

pacraw

preset for raw long-reads from Pacific Bioscience (PacBio) typically with an N50 ~[8kb-60kb].

pacccs (experimental)

preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically with an N50 ~[15kb]. This type of data is not fully supported yet.

Wengan demo

The repository wengan_demo contains a small dataset and instructions to test Wengan v0.2.

#fetch the demo dataset
git clone https://github.com/adigenova/wengan_demo.git

Wengan benchmark

Genome Long reads Short reads Wengan Mode NG50 (Mb) CPU (h) RAM (GB) Fasta file
2x150bp 50X (GIAB:rs1 , rs2) WenganA 25.99 725 45 asm
NA12878 ONT 35X (rel5) 2x150bp 50X (GIAB:rs1 , rs2) WenganM 17.23 203 53 asm
2x250bp 60X (ENA:rs1 , rs2) WenganD 35.31 589 622 asm
HG00073 PAC 90X (ENA:rl1) 2x250bp 63X (ENA:rs1 , rs2) WenganD 32.35 936 644 asm
NA24385 ONT 60X (GIAB:rl1) 2x250bp 70X (GIAB:rs1) WenganD 50.59 963 651 asm
CHM13 ONT 50X (T2T:rel3) 2x250bp 66X (ENA:rs1 , rs2) WenganD 69.72 1198 646 asm

The assemblies generated using Wengan (v0.2) can be downloaded from Zenodo. All the assemblies were ran as described in the Wengan manuscript. NG50 was computed using a genome size of 3.08Gb.

Wengan components

Getting the latest source code

Instructions

It is recommended to use/download the latest binary release (Linux) from : https://github.com/adigenova/wengan/releases

Containers

To facilitate the execution of Wengan, we provide docker/singularity containers. Wengan images are hosted on Dockerhub and can be downloaded with the command:

docker pull adigenova/wengan:v0.2

Alternatively, using singularity:

export TMPDIR=/tmp
singularity pull docker://adigenova/wengan:v0.2

Run WenganM using singularity

#using singularity
CONTAINER=/path_to_container/wengan_v0.2.sif

#location of wengan in the container
WENGAN=/wengan/wengan-v0.2-bin-Linux/wengan.pl

#run WenganM with singularity exec
singularity exec $CONTAINER perl ${WENGAN} \
 -x pacraw \
 -a M \
 -s short.R1.fastq.gz,short.R2.fastq.gz \
 -l pacbio.clr.fastq.gz \
 -p asm_wengan -t 20 -g 3000

Building Wengan from source

To compile Wengan run the following command:

#fetch Wengan and its components
git clone --recursive https://github.com/adigenova/wengan.git wengan

There are specific instructions for each Wengan component. After compilation you have to copy the binaries to wengan-dir/bin.

Requirements

c++ compiler; compilation was tested with gcc version GCC/7.3.0-2.30 (Linux) and clang-1000.11.45.5 (Mac OSX). cmake 3.2+.

Specific component source code versions used to build Wengan v0.2

  1. abyss commit d4b4b5d
  2. discovarexp-51885 commit f827bab
  3. minia commit 017d23e
  4. fastmin-sg commit 861b061
  5. intervalmiss commit 11be8b42
  6. liger commit 63a044b0
  7. seqtk commit 2efd0c8

Limitations

1.- Genomes larger than 4Gb are not supported yet.

About the name

Wengan is a Mapudungun word. Mapudungun is the language of the Mapuche people, the largest indigenous inhabitants of south-central Chile. Wengan means "Making the path".

Citation

Di Genova, A., Buena-Atienza, E., Ossowski, S. and Sagot,M-F. Efficient hybrid de novo assembly of human genomes with WENGAN. Nature Biotechnology (2020), link