DISCO, Distributed Co-assembly of Overlap graphs, is a multi threaded and multiprocess distributed memory overlap-layout-consensus (OLC) metagenome assembler - DISCO. The detailed user manual of the assembler and how to use it to acheive best results is provided here: http://disco.omicsbio.org/user-manual. This is a quick start guide generally for developers and testers. Users with limited experience with genome assembly are advised to use the user manual.
runDisco...
scripts can be used to run the assembler. There are two basic versions of the assembler one for running on a single machine and another for running with MPI on a cluster. Both versions require data pre-processing of raw illumina reads. We provide two scripts to perform data pre-processing. The details of the pre-processing are provided in the Preprocessing of the Illumina data section below. If your data is pre-processed please continue to the Quickly Running DISCO section.
#!/bin/bash
# Pre-processing and assembly of separated paired end reads
runAssembly.sh -d ${output_dir} -in1 readA_1.fastq -in2 readA_2.fastq -n ${num_threads} -o ${OP_PREFIX}
# Pre-processing and assembly of interleaved paired end reads
runAssembly.sh -d ${output_dir} -inP readA.fastq.gz,readB.fastq.gz -n ${num_threads} -o ${OP_PREFIX}
#!/bin/bash
# Pre-processing and assembly of separated paired end reads
runECC.sh -d ${output_dir} -in1 readA_1.fastq -in2 readA_2.fastq -n ${num_threads} -o ${OP_PREFIX}
# Pre-processing and assembly of interleaved paired end reads
runECC.sh -d ${output_dir} -inP readA.fastq.gz,readB.fastq.gz -n ${num_threads} -o ${OP_PREFIX}
There are two versions of the assembler for running on a single machine and for running with MPI on a cluster.
./runDisco.sh
. Make sure the RAM on the machine is more than the disk space size of the uncompressed reads. The quick start command as shown below will be used in a batch job submission script or directly typed on the commandline terminal. #!/bin/bash
# Separated paired end reads
runDisco.sh -d ${output_dir} -in1 readA_1.fastq -in2 readA_2.fastq -n ${num_threads} -o ${OP_PREFIX}
# Interleaved paired end reads
runDisco.sh -d ${output_dir} -inP readA.fastq.gz,readB.fastq.gz -n ${num_threads} -o ${OP_PREFIX}
Use ./runDisco.sh -h
for help information.
MPI Version: This version of the assembler should be used if you are going to run the assembler with MPI support on a cluster. The run script to invoke the assembler depends on the cluster management and job scheduling system.
runDisco-MPI.sh
. runDisco-MPI-SLRUM.sh
.runDisco-MPI-ALPS.sh
.For the basic MPI version make sure the RAM on the nodes is more than the disk space size of the reads. If you have a large dataset, then use the Remote Memory Access (RMA) version. The RMA version of the assembler will equally distribute about 70% of the memory usage across all the MPI nodes. The quick start commands are:
#!/bin/bash
### MPI Verion
### Separated paired end reads
runDisco-MPI.sh -d ${output_dir} -in1 {read_1.fastq} -in2 ${read2_2.fastq} -o ${OP_PREFIX}
### MPI Remote Memory Access(RMA) Verion
### Separated paired end reads
runDisco-MPI.sh -d ${output_directory} -in1 {read_1.fastq} -in2 ${read2_2.fastq} -o ${OP_PREFIX} -rma
Use runDisco-MPI.sh -h
for help information.
The raw Illumina sequences need to be preprocessed before assembly with Disco. Disco provides wrapper scripts to perform preprocessing with BBTools. Please see user manual for more details: http://disco.omicsbio.org/user-manual. We package BBtools inside our release for ease of use. The BBtools scripts shown below are available in the bbmap directory.
Since Disco works best with reads without errors, preprocessing plays an important role in deciding the quality of the assembly results. The 3 basic pre-processing steps are trimming, filtering and eror correction.
We have tested Brian Bushnell's suite of tools BBTools extensively on Illumina data and have obtained good results. Suppose the Illumina reads data set is called $reads
, the steps we recommend are following:
#!sh
# Use bbduk.sh to quality and length trim the Illumina reads and remove adapter sequences
# 1. ftm = 5, right trim read length to a multiple of 5
# 2. k = 11, Kmer length used for finding contaminants
# 3. ktrim=r, Trim reads to remove bases matching reference kmers to the right
# 4. mink=7, look for shorter kmers at read tips down to 7 bps
# 5. qhdist=1, hamming distance for query kmers
# 6. tbo, trim adapters based on where paired reads overlap
# 7. tpe, when kmer right-trimming, trim both reads to the minimum length of either
# 8. qtrim=r, trim read right ends to remove bases with low quality
# 9. trimq=15, regions with average quality below 10 will be trimmed.
# 10. minlength=70, reads shorter than 70bps after trimming will be discarded.
# 11. ref=$adapters, adapters shipped with bbnorm tools
# 12. –Xmx8g, use 8G memory
# 13. 1>trim.o 2>&1, redirect stderr to stdout, and save both to file *trim.o*
adapters= bbmap/resources/adapters.fa
artifacts= bbmap/resources/sequencing_artifacts.fa.gz
phiX_adapters= bbmap/resources/phix174_ill.ref.fa.gz
bbduk.sh in=$reads out=trim.fq.gz ktrim=r k=23 mink=7 hdist=1 tpe tbo ref=${adapters} ftm=5 qtrim=r trimq=15
bbduk.sh in=trim.fq.gz out=filter.fq.gz k=23 hdist=1 ref=${artifacts},${phiX_adapters}
Tarpole is a memory efficient error correction tool from the bbtools package that runs within reasonable time. We also use the bbmerge tool from the same package to error correct the overlapping paired end reads. We suggest using the following commands for error correction.
#!bash
# 1. ecco mode of bbmerge for correction of overlapping paired end reads without merging
# 2. mode=correct, use tadpole for correction
bbmerge.sh in=filter.fq.gz out=ecc.fq.gz ecco mix adapters=default
tadpole.sh in=ecc.fq.gz out=tecc.fq.gz ecc ordered prefilter=1
#If the above goes out of memory, try
tadpole.sh in=ecc.fq.gz out=tecc.fq.gz ecc ordered prefilter=2
The Disco assembler is invoked through the run script ./runDisco.sh
. The basic quick start commands with default parameters are as follows. The default parameters are based on empherical tests on real metagenomic datasets.
#!/bin/bash
# Separated paired end reads
runDisco.sh -d ${output_directory} -in1 {read_1.fastq} -in2 ${read2_2.fastq} -n ${num_threads} -m {max_mem_usage} -o ${64gen}
# Interleaved paired end reads
runDisco.sh -d ${output_directory} -inP {read_P.fastq} -n ${num_threads} -m {max_mem_usage} -o ${64gen}
# Single end reads
runDisco.sh -d ${output_directory} -inS {read.fastq} -n ${num_threads} -m {max_mem_usage} -o ${64gen}
For all the options of Disco, use ./runDisco.sh -h
In case the program crashes due to exceeding wall clock time, the assembler can be restarted with the same command.
The assembler can be run on a distributed machine using the three distributed assembly scripts.
Usage:
runDisco.sh [OPTION]......