aomlomics / tourmaline

Amplicon sequence processing workflow using QIIME 2 and Snakemake
BSD 3-Clause "New" or "Revised" License
42 stars 19 forks source link
amplicon-sequence-variants dada2 deblur noaa-omics-software snakemake
png/tourmaline_banner png/figure1

Tourmaline

Tourmaline is an amplicon sequence processing workflow for Illumina sequence data that uses QIIME 2 and the software packages it wraps. Tourmaline manages commands, inputs, and outputs using the Snakemake workflow management system.

The current version of Tourmaline supports qiime2-2023.5. To use previous versions of Qiime2, check out previous Tourmaline versions under Releases.

Why should I use Tourmaline?

Tourmaline has several features that enhance usability and interoperability:

What QIIME 2 options does Tourmaline support?

If you have used QIIME 2 before, you might be wondering which QIIME 2 commands Tourmaline uses and supports. All commands are specified as rules in Snakefile, and typical workflows without and with sequence filtering are shown as directed acyclic graphs in the folder dags. The main analysis features and options supported by Tourmaline and specified by the Snakefile are as follows:

How do I cite Tourmaline?

Please cite our paper in GigaScience:

How do I get started?

If this is your first time using Tourmaline or Snakemake, you may want to browse through the Wiki for a detailed walkthrough. If you want to get started right away, check out the Quick Start below and follow along with the video tutorial on YouTube.

Contact us

Quick Start

Tourmaline provides Snakemake rules for DADA2 (single-end and paired-end) and Deblur (single-end). For each type of processing, there are four steps:

  1. the denoise rule imports FASTQ data and runs denoising, generating a feature table and representative sequences;
  2. the taxonomy rule assigns taxonomy to representative sequences;
  3. the diversity rule does representative sequence curation, core diversity analyses, and alpha and beta group significance; and
  4. the report rule generates an HTML report of the outputs plus metadata, inputs, and parameters. Also, the report rule can be run immediately to run the entire workflow.

Steps 2–4 have unfiltered and filtered modes, the difference being that in the taxonomy step of filtered mode, undesired taxonomic groups or individual sequences from the representative sequences and feature table are removed. The diversity and report rules are the same for unfiltered and filtered modes, except the output goes into separate subdirectories.

Install

The current version of Tourmaline supports qiime2-2023.5. To use previous versions of Qiime2, check out previous Tourmaline versions under Releases.

Before you download the Tourmaline commands and directory structure from GitHub, you first need to install QIIME 2, Snakemake, and the other dependencies of Tourmaline. Two options are provided: a native installation on a Mac or Linux system and a Docker image/container. If you have an Apple Silicon chip (M1, M2 Macs), the instructions to install QIIME 2 vary slightly.

Option 1: Native installation

To run Tourmaline natively on a Mac (Intel) or Linux system, start with a Conda installation of Snakemake.

conda create -c conda-forge -c bioconda -n snakemake snakemake-minimal

Then install QIIME 2 with conda (for Linux, change "osx" to "linux"):

wget https://data.qiime2.org/distro/core/qiime2-2023.5-py38-osx-conda.yml
conda env create -n qiime2-2023.5 --file qiime2-2023.5-py38-osx-conda.yml

Activate the qiime2-2023.5 environment and install the other Conda- or PIP-installable dependencies:

conda activate qiime2-2023.5
conda install -c conda-forge -c bioconda biopython muscle clustalo tabulate
conda install -c conda-forge deicode
pip install empress
qiime dev refresh-cache
conda install -c bioconda bioconductor-msa bioconductor-odseq
Apple Silicon Macs

Follow these instructions for Macs with M1/M2 chips.

First, set your Terminal application to run in Rosetta mode.

wget https://data.qiime2.org/distro/core/qiime2-2023.5-py38-osx-conda.yml
CONDA_SUBDIR=osx-64 conda env create -n qiime2-2023.5 --file qiime2-2023.5-py38-osx-conda.yml
conda activate qiime2-2023.5
conda config --env --set subdir osx-64

Then continue to install the other Conda- or PIP-installable dependencies.

Option 2: Docker container

To run Tourmaline inside a Docker container:

  1. Install Docker Desktop (Mac, Windows, or Linux) from Docker.com.
  2. Open Docker app.
  3. Increase the memory to 8 GB or more (Preferences -> Resources -> Advanced -> Memory).
  4. Download the Docker image from DockerHub (command below).
  5. Run the Docker image (command below).
docker pull aomlomics/tourmaline
docker run -v $HOME:/data -it aomlomics/tourmaline

If installing on a Mac with an Apple M1 chip, run the Docker image with the --platform linux/amd64 command. It will take a few minutes for the image to load the first time it is run.

docker run --platform linux/amd64 -v $HOME:/data -it aomlomics/tourmaline

The -v (volume) flag above allows you to mount a local file system volume (in this case your home directory) to read/write from your container. Note that symbolic links in a mounted volume will not work.

Use mounted volumes to:

See the Install page for more details on installing and running Docker.

Setup

If this is your first time running Tourmaline, you'll need to set up your directory. Simplified instructions are below, but see the Wiki's Setup page for complete instructions.

Start by cloning the Tourmaline directory and files:

git clone https://github.com/aomlomics/tourmaline.git

If using the Docker container, it's recommended you run the above command from inside /data.

Setup for the test data

The test data (16 samples of paired-end 16S rRNA data with 1000 sequences per sample) comes with your cloned copy of Tourmaline. It's fast to run and will verify that you can run the workflow.

Download reference database sequence and taxonomy files, named refseqs.qza and reftax.qza (QIIME 2 archives), in 01-imported:

cd tourmaline/01-imported
wget https://data.qiime2.org/2023.5/common/silva-138-99-seqs-515-806.qza
wget https://data.qiime2.org/2023.5/common/silva-138-99-tax-515-806.qza
ln -s silva-138-99-seqs-515-806.qza refseqs.qza
ln -s silva-138-99-tax-515-806.qza reftax.qza

Edit FASTQ manifests manifest_se.csv and manifest_pe.csv in 00-data so file paths match the location of your tourmaline directory. In the command below, replace /path/to with the location of your tourmaline directory—or skip this step if you are using the Docker container and you cloned tourmaline into /data:

cd ../00-data
cat manifest_pe.csv | sed 's|/data/tourmaline|/path/to/tourmaline|' > temp && mv temp manifest_pe.csv 
cat manifest_pe.csv | grep -v "reverse" > manifest_se.csv

Go to Run Snakemake.

Setup for your data

Before setting up to run your own data, please note:

Now edit, replace, or store the required input files as described here:

  1. Edit or replace the metadata file 00-data/metadata.tsv. The first column header should be "sample_name", with sample names matching the FASTQ manifest(s), and additional columns containing any relevant metadata for your samples. You can use a spreadsheet editor like Microsoft Excel or LibreOffice, but make sure to export the output in tab-delimited text format.
  2. Prepare FASTQ data:
    • Option 1: Edit or replace the FASTQ manifests 00-data/manifest_pe.csv (paired-end) and/or 00-data/manifest_se.csv (single-end). Ensure that (1) file paths in the column "absolute-filepath" point to your .fastq.gz files (they can be anywhere on your computer) and (2) sample names match the metadata file. You can use a text editor such as Sublime Text, nano, vim, etc.
    • Option 2: Store your pre-imported FASTQ .qza files as 01-imported/fastq_pe.qza (paired-end) and/or 01-imported/fastq_se.qza (single-end).
  3. Prepare reference database:
    • Option 1: Store the reference FASTA and taxonomy files as 00-data/refseqs.fna and 00-data/reftax.tsv.
    • Option 2: Store the pre-imported reference FASTA and taxonomy .qza files as 01-imported/refseqs.qza and 01-imported/reftax.qza.
  4. Edit the configuration file config.yaml to set DADA2 and/or Deblur parameters (sequence truncation/trimming, sample pooling, chimera removal, etc.), rarefaction depth, taxonomic classification method, and other parameters. This YAML (yet another markup language) file is a regular text file that can be edited in Sublime Text, nano, vim, etc.
  5. Go to Run Snakemake.

Run Snakemake

Tourmaline is now run within the snakemake conda environment, not the qiime2-2023.5 environment.

conda activate snakemake

Shown here is the DADA2 paired-end workflow. See the Wiki's Run page for complete instructions on all steps, denoising methods, and filtering modes.

Note that any of the commands below can be run with various options, including --printshellcmds to see the shell commands being executed and --dryrun to display which rules would be run but not execute them. To generate a graph of the rules that will be run from any Snakemake command, see the section "Directed acyclic graph (DAG)" on the Run page. Always include the --use-conda option.

From the tourmaline directory (which you may rename), run Snakemake with the denoise rule as the target, changing the number of cores to match your machine:

snakemake --use-conda dada2_pe_denoise --cores 4

Pausing after the denoise step allows you to make changes before proceeding:

Unfiltered mode

Continue the workflow without filtering (for now). If you are satisfied with your parameters and files, run the taxonomy rule (for unfiltered data):

snakemake --use-conda dada2_pe_taxonomy_unfiltered --cores 4

Next, run the diversity rule (for unfiltered data):

snakemake --use-conda dada2_pe_diversity_unfiltered --cores 4

Finally, run the report rule (for unfiltered data):

snakemake --use-conda dada2_pe_report_unfiltered --cores 4

Filtered mode

After viewing the unfiltered results—the taxonomy summary and taxa barplot, the representative sequence summary plot and table, or the list of unassigned and potential outlier representative sequences—the user may wish to filter (remove) certain taxonomic groups or representative sequences. If so, the user should first check the following parameters and/or files:

Now we are ready to filter the representative sequences and feature table, generate new summaries, and generate a new taxonomy bar plot, by running the taxonomy rule (for filtered data):

snakemake --use-conda dada2_pe_taxonomy_filtered --cores 4

Next, run the diversity rule (for filtered data):

snakemake --use-conda dada2_pe_diversity_filtered --cores 4

Finally, run the report rule (for filtered data):

snakemake --use-conda dada2_pe_report_filtered --cores 1

View output

View report and output files

Open your HTML report (e.g., 03-reports/report_dada2-pe_unfiltered.html) in Chrome{target="_blank"} or Firefox{target="_blank"}. To view the linked files:

Downloaded files can be deleted after viewing because they are already stored in your Tourmaline directory.

More tips

Troubleshooting

Power tips

Alternatives

Some alternative pipelines for amplicon sequence analysis include the following: