biocore / deblur

Deblur is a greedy deconvolution algorithm based on known read error profiles.
BSD 3-Clause "New" or "Revised" License
92 stars 41 forks source link

Deblur

Build Status Coverage Status

Deblur is a greedy deconvolution algorithm for amplicon sequencing based on Illumina Miseq/Hiseq error profiles.

Install

and activate it using:

source activate deblurenv

(note you will need to activate this environment every time you want to use Deblur)

Install Deblur dependencies and Deblur itself:

conda install -c bioconda -c biocore VSEARCH>=2.7.0 MAFFT>=7.394 SortMeRNA=2.0 biom-format deblur

N.B. Some dependencies are version restricted at the moment but for different reasons. SortMeRNA 2.1 has a different output format which Deblur is not compatible with yet. A review of the changelog did not reveal any remarkable notes (e.g., bugs) about the reasons for the differences. In testing, the differences affected <0.1% of the sOTUs. As a precaution, we are advising the use of these specific versions for consistency with the manuscript.

Example usage

We recommend using Deblur via the QIIME2 plugin q2-deblur. Examples of its use can be found within the plugin itself. However, Deblur itself does not depend on QIIME2.

If you are running Deblur directly, we recommend focusing on the workflow subcommand. Detailed help can be obtained with:

deblur workflow --help

As a simple example, let's specify an input FASTA file, an output path and a sequence trim length of 150. This command will trim all sequences in all_samples.fna to 150nt in length; any read that is shorter will be omitted. This execution mode assumes that all_samples.fna is demultiplexed such that the sequence IDs are compatible with QIIME 1.9.1. On completion, a new directory output will be created with multiple output files (see the Input and Output Files section for more detail).

deblur workflow --seqs-fp all_samples.fna --output-dir output -t 150

If starting from a barcode and read file, you can first use the qiime split_libraries_fastq.py command (we recommend using -q 19 to remove low quality reads):

split_libraries_fastq.py -i XXX_R1_001.fastq -m map.txt -o split -b XXXX_I1_001.fastq -q 19

The resulting split/seqs.fna file can be used as the input to the deblur workflow.

Input and Output Files

The input to Deblur workflow is a directory of FASTA or FASTQ files (1 per sample) or a single demultiplexed FASTA or FASTQ file. These files can be gzip'd. The output directory will contain three BIOM tables in which the observation IDs are the Deblurred sequences. The outputs are contingent on the reference databases used and a more focused discussion on them is in the subsequent README section titled "Positive and Negative Filtering." The output files are as follows:

Important options

Deblur cannot associate sequences with different lengths. As such, trimming reads is a required first step in the Deblur pipeline. The sequence trim length is specified by the -t NNN flag, where NNN denotes the length all sequences will be trimmed to. All reads shorter than this length will be discarded. If the input data are known to have a common length, it is possible to disable trimming by specifying a trim value of -1.

Deblur can operate in parallel. The number of threads can be specified by the -O NNN flag (default it 1). Running more threads than available cores is not advised.

Positive and Negative Filtering

Deblur uses two types of filtering on the sequences:

The FASTA files for both of these filtering steps can be supplied via the --neg-ref-fp and --pos-ref-fp options. By default, the negative database is composed of PhiX and adapter sequence and the positive database of known 16S sequences.

Deblur uses negative mode filtering to remove known artifact (i.e. PhiX and Adapter sequences) prior to denoising. The output of Deblur contains three files: all.biom, which includes all sOTUs, reference-hit.biom, which contains the output of positive filtering of the sOTUs (default only sOTUs similar to 16S sequences), and reference-non-hit.biom, which contains only sOTUs failing the positive filtering (default only non-16S sOTUs).

Minimal Reads Filtering

Deblur runs on each sample independently. However, sometimes there is also additional information based on the total number of times an sOTU is observed in all samples (e.g. an sOTU which is observed only in one sample at low read count is more likely to be pcr/read error as opposed to an sOTU present in many samples). The --min-reads option allows to use this information by removing sOTUs with a total read count (across all samples) lower than the given threshold. The default value is set to 10, and should be useful for most cases. However, if such filtering is not wanted (e.g. if all samples in an experiment are expected not to contain the same bacteria, so no additional information is gained by combining the information from multiple samples), --min-reads can be set to 0 to skip this final filtering step.

Troubleshooting

RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are Working with Matplotlib in a virtual enviroment see 'Working with Matplotlib in Virtual environments' in the Matplotlib FAQ

You can solve it by the following commands:

cd ~/.matplotlib
echo "backend: TkAgg" >> ~/.matplotlib/matplotlibrc

Code Development Note

Some of the code in the package deblur has been derived from QIIME. The contributors to these specific QIIME modules have granted permission for this porting to take place and put under the BSD license.