SaraEl-Metwally / LightAssembler

Lightweight resources assembly algorithm
GNU General Public License v3.0
19 stars 1 forks source link

LightAssembler

Lightweight resources assembly algorithm for high-throughput sequencing reads. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of g-spaced sequenced kmers and the other holding kmers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools, more details about LightAssembler can be found in : El-Metwally, S., Zakaria, M. and Hamza, T.; [LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw470). Bioinformatics 2016; 32 (21): 3215-3223. doi: 10.1093/bioinformatics/btw470.
Copyright (C) 2015-2016, and GNU GPL, by Sara El-Metwally, Magdi Zakaria and Taher Hamza.

System requirements

64-bit machine with g++ compiler or gcc in general, pthreads,and zlib libraries.

Installation

  1. Clone the GitHub repo, e.g. with git clone https://github.com/SaraEl-Metwally/LightAssembler.git
  2. Run make in the repo directory for k <= 31 or make k=kmersize for k > 31, e.g. make k=49.

Quick usage guide

./LightAssembler -k [kmer size] -g [gap size] -e [error rate] -G [genome size] -t
[threads] -o [output prefix] [input files] --verbose 
* [-k] kmer size                [default: 31]
* [-g] gap size                 [default: 25X:3 35X:4 75X:8 140X:15 280X:25]
* [-e] error rate               [default: 0.01]
* [-G] genome size              [default: 0]
* [-t] number of threads        [default: 1]
* [-o] output prefix file name  [default: LightAssembler]

Notes

Read files

LightAssembler assembles multiple input files of the sequencing reads given in fasta/fastq format. Also, LightAssembler can read directly the input files compressed with gzip fasta.gz/fastq.gz.

Outputs

The output of LightAssembler is the set of assembled contigs in fasta format, in the file:

[output prefix].contigs.fasta

LightAssembler also reports the following on the screen:

Also, by using the --verbose option, LightAssembler reports the additional details for each step such as the number of kmers, the false positive rate of Bloom filters and the number of branching kmers in the dataset, the average read length and the average sequencing coverage.

Example 1

./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose

--- Uniform kmers sampling. 

--- h(0):m(0):s(5) elapsed time.
--- total number of kmers in BloomA = 7791111
--- BloomA false positive rate = 0.00193375
--- average read length = 101
--- average sequencing coverage = 35
--- probability of an incorrect kmer appears in the sample : 0.0249524

--- Trusted/untrusted kmers filtering. 

--- h(0):m(0):s(24) elapsed time.
--- total number of kmers in BloomB = 4548112
--- BloomB false positive rate = 7.7715e-05

--- Branching-kmers computation. 

--- h(0):m(0):s(5) elapsed time.
--- number of branching kmers = 54644

--- Graph traversal. 

--- h(0):m(0):s(16) elapsed time.
--- number of contigs     = 731
--- maximum contig length = 120924
--- assembly size         = 4473869
--- genome coverage       = 95.4703%

--- The assembly session is finished. 

--- h(0):m(0):s(31) elapsed time. 

Example 2 (missing g)

./LightAssembler -k 31 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose 
--- Parameters extrapolation. 

--- h(0):m(0):s(1) elapsed time.
--- start with gap size g = 4
--- average read length = 101
--- average sequencing coverage = 35

--- Uniform kmers sampling. 

--- h(0):m(0):s(8) elapsed time.
--- total number of kmers in BloomA = 27604568
--- BloomA false positive rate = 0.0375047
--- probability of an incorrect kmer appears in the sample : 0.118144

--- Trusted/untrusted kmers filtering. 

--- h(0):m(0):s(9) elapsed time.
--- total number of kmers in BloomB = 4655530
--- BloomB false positive rate = 9.1219e-05

--- Branching-kmers computation. 

--- h(0):m(0):s(2) elapsed time.
--- number of branching kmers = 57242

--- Graph traversal. 

--- h(0):m(0):s(22) elapsed time.
--- number of contigs     = 747
--- maximum contig length = 127975
--- assembly size         = 4474072
--- genome coverage       = 95.4746%

--- The assembly session is finished. 

--- h(0):m(0):s(42) elapsed time.

Example 3 (without --verbose)

./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose

--- Uniform kmers sampling. 

--- h(0):m(0):s(2) elapsed time.

--- Trusted/untrusted kmers filtering. 

--- h(0):m(0):s(11) elapsed time.

--- Branching-kmers computation. 

--- h(0):m(0):s(1) elapsed time.

--- Graph traversal. 

--- h(0):m(0):s(17) elapsed time.
--- number of contigs     = 731
--- maximum contig length = 120924
--- assembly size         = 4473869
--- genome coverage       = 95.4703%

--- The assembly session is finished. 

--- h(0):m(0):s(31) elapsed time.