biobakery / biobakery_workflows

bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters.
http://huttenhower.sph.harvard.edu/biobakery_workflows
Other
97 stars 33 forks source link
analysis-workflows biobakery private

bioBakery Workflows

bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters. Quality control and statistical summary reports are automatically generated for most data types, which include 16S amplicons, metagenomes, and metatranscriptomes. Workflows are run directly from the command line and tasks can be imported to create your own custom workflows. The workflows and tasks are built with AnADAMA2 which allows for parallel task execution locally and in a grid compute environment.

For additional information, see the bioBakery workflows tutorial.

Table of contents


Getting Started


Requirements

  1. AnADAMA2 (installed automatically)
  2. Python v2.7+
  3. See individual workflows and tasks for additional software requirements.

Installation

Install software

bioBakery workflows can be installed with Conda, Docker, or pip.

To install with Conda:

$ conda install -c biobakery biobakery_workflows

To install and run with Docker:

$ docker run -it biobakery/workflows bash

To install with pip:

$ pip install biobakery_workflows

Install databases

Install automatically

Once the software and dependencies are installed, the databases can be installed automatically.

Run the following command to install the databases required for a workflow:

$ biobakery_workflows_databases --install $WORKFLOW

Install manually

Alternatively the databases can be installed manually and then referenced with environment variables. The shotgun data processing workflows require Kneaddata (human, human transcriptome, and SILVA), HUMAnN (utility mapping, nucleotide, and protein databases), and StrainPhlAn (reference and marker) databases while the 16s data processing workflow requires the GreenGenes fasta, taxonomy, and usearch formatted files.

When manually installing the databases, the following environment variables need to be set.

How to Run

Basic Usage

All workflows follow the general command format:

$ biobakery_workflows $WORKFLOW --input $INPUT --output $OUTPUT

For a list of all available workflows, run:

$ biobakery_workflows --help

For specific options for a workflow, run:

$ biobakery_workflows $WORKFLOW --help

Data Processing Workflows

The basic command to run a data processing workflow, replacing $WORKFLOW with the workflow name, is:

$ biobakery_workflows $WORKFLOW --input $INPUT_DIR --output $DATA_OUTPUT_DIR

This command will run the workflow on the files in the input folder ($INPUT_DIR to be replaced with the path to the folder containing fastq files). It will write files to the output folder ($DATA_OUTPUT_DIR to be replaced with the folder to write output files).

Visualization Workflow

A single visualization workflow exists that can be used for any data processing workflow. The basic command to run a visualization workflow, replacing $WORKFLOW_VIS with the visualization workflow name, is:

$ biobakery_workflows $WORKFLOW_VIS --input $DATA_OUTPUT_DIR --project-name $PROJECT --output $OUTPUT_DIR

The input folder ($DATA_OUTPUT_DIR to be replaced with the path to the folder) in this command this is a subset of the output folder from the data processing workflow; Run the workflow with the option --help to determine which files are required and which are optional to run the workflow. The folder ($OUTPUT_DIR to be replaced with the path to the output folder) will contain the output files from the visualization workflow. The project name should replace $PROJECT in the command so the report can include the name.

Parallelization Options

When running any workflow you can add the following command line options to make use of existing computing resources:

For additional workflow options, see the AnADAMA2 user manual.


Data Processing Workflows


bioBakery workflows includes a collection of workflows for shotgun sequences and 16s data processing. Most workflows can be run on the command line with the following syntax:

$ biobakery_workflows $WORKFLOW --input $INPUT --output $OUTPUT

See the section on parallelization options to optimize the workflow run based on your computing resources.

Whole Metagenome Shotgun (wmgx)

Super Tasks

  1. [Quality control]
  2. [Taxonomic profiling]
  3. [Functional profiling]
  4. Strain profiling
  5. Assembly (not run by default)

Requirements

  1. KneadData (v0.12.0+)
    1. Install with: $ conda install -c biobakery kneaddata OR $ pip install kneaddata
  2. MetaPhlAn
    1. Install with: $ conda install -c bioconda metaphlan
  3. HUMAnN
    1. Install with: $ conda install -c biobakery humann OR $ pip install humann
  4. StrainPhlAN
    1. Install with: $ conda install -c bioconda strainphlan
  5. Prokka (Only required if running in assembly mode)
  6. MegaHit (Only required if running in assembly mode)
  7. seqtk (Only required if running in assembly mode)

Inputs

  1. A set of fastq (or fastq.gz) files (single-end or paired-end). The files are expected to be named $SAMPLE.fastq.gz,$SAMPLE.R1.fastq.gz, or $SAMPLE.R2.fastq.gz where $SAMPLE is the sample name or identifier corresponding to the sequences. $SAMPLE can contain any characters except spaces or periods.

The workflow will detect if paired-end files are present. By default the workflow identifies paired end reads based on file names containing ".R1" and ".R2" strings. If your paired end reads have different identifiers, use the option --pair-identifier .R1 to provide the identifier string for the first file in the set of pairs.

The workflow by default expects input files with the extension "fastq.gz". If your files are not gzipped, run with the option --input-extension fastq.

To run the workflow

To run a demo

Whole Metagenome and Metatranscriptome Shotgun (wmgx_wmtx)

Super Tasks

  1. [Quality control]
  2. [Taxonomic profiling]
  3. [Functional profiling]
  4. Strain profiling

Requirements

  1. KneadData (v0.12.0+)
    1. Install with: $ conda install -c biobakery kneaddata OR $ pip install kneaddata
  2. MetaPhlAn
    1. Install with: $ conda install -c bioconda metaphlan
  3. HUMAnN
    1. Install with: $ conda install -c biobakery humann OR $ pip install humann
  4. StrainPhlAN
    1. Install with: $ conda install -c bioconda strainphlan

Inputs

  1. Two sets of fastq (or fastq.gz) files (single-end or paired-end). One set is of whole metagenome shotgun data and the other is whole metatranscriptome shotgun data. The files are expected to be named $SAMPLE.fastq.gz,$SAMPLE.R1.fastq.gz, or $SAMPLE.R2.fastq.gz where $SAMPLE is the sample name or identifier corresponding to the sequences. $SAMPLE can contain any characters except spaces or periods.
  2. Optionally, provide a mapping file. This file will have two columns and be tab delimited. The first column is the sample names for the metatranscriptomes and the second is the corresponding metagenome sample. See the demo mapping file for an example.

The workflow will detect if paired-end files are present. By default the workflow identifies paired end reads based on file names containing ".R1" and ".R2" strings. If your paired end reads have different identifiers, use the option --pair-identifier .R1 to provide the identifier string for the first file in the set of pairs.

The workflow by default expects input files with the extension "fastq.gz". If your files are not gzipped, run with the option --input-extension fastq.

To run the workflow

To run a demo

16S rRNA (16s)

The 16s workflow has two methods that can be used: UPARSE (with either USEARCH or VSEARCH (default)) and DADA2. All methods perform quality control and generate taxonomic tables.

Workflow diagrams

Super Tasks

  1. Demultiplex (only for raw reads)
  2. Merge Samples and Rename
  3. Quality Control (Learn Error Rates for DADA2 method)
  4. Taxonomic Profiling
  5. Functional Profiling (with PICRUSt v1 or v2; v2 requires python3)

Requirements

  1. Vsearch
  2. Usearch (v9.0.2132) (Not required for DADA2 or VSEARCH method)
  3. PICRUSt v1.1 (or v2)
    1. Install with: $ conda install -c bioconda picrust
    2. OR Install with: $ conda install -c bioconda picrust2
  4. BIOM v2 (Not required for DADA2 method)
    1. Install with: $ pip install biom-format
  5. Clustal Omega
    1. Install with: $ conda install -c bioconda clustal-omega
  6. EA-Utils (Only required for raw reads)
    1. Install with: $ conda install -c bioconda ea-utils
  7. FastTree
    1. Install with: $ conda install -c bioconda fasttree
  8. R and the following packages: dada2, gridExtra, tools, ggplot2, seqinr (Only required for DADA2 method)
    1. Follow the DADA2 install directions
  9. FIGARO (Only required if using FIGARO to estimate truncation lengths for DADA2)

Inputs

  1. A set of fastq (or fastq.gz) files (single-end or paired-end; only pair-end for DADA2) raw or demultiplexed. If the files are demultiplexed, the files are expected to be named $SAMPLE.fastq.gz,$SAMPLE_R1_001.fastq.gz, or $SAMPLE_R2_001.fastq.gz where $SAMPLE is the sample name or identifier corresponding to the sequences. $SAMPLE can contain any characters except spaces or periods.
  2. A barcode file (only required for raw reads).

The workflow will detect if paired-end files are present. By default the workflow identifies paired end reads based on file names containing "_R1_001" and "_R2_001" strings. If your paired end reads have different identifiers, use the option --pair-identifier .R1. to provide the identifier string for the first file in the set of pairs.

The workflow by default expects input files with the extension "fastq.gz". If your files are not gzipped, run with the option --input-extension fastq.

To run the workflow

Isolate Assembly (isolate_assembly)

This workflow will assemble and annotate sequenced microbial isolate genomes. It runs the raw sequences through quality control, assembly (with SPAdes), annotation (with Prokka), functional annotation, quality assessment, and then creates a final annotated contig file.

Workflow diagram

Super Tasks

  1. Quality control
  2. Assembly
  3. Annotation
  4. Functional annotation
  5. Quality assessment

Requirements

  1. KneadData (v0.12.0+)
    1. Install with: $ conda install -c biobakery kneaddata OR $ pip install kneaddata
  2. SPAdes
    1. Install with: $ conda install -c bioconda spades
  3. Prokka
    1. Install with: $ conda install -c bioconda prokka
  4. Quast
    1. Install with: $ conda install -c bioconda quast
  5. CheckM (plus databases)
  6. run_DBcan (plus databases)
  7. Emapper (version 2+) (databases installed with biobakery utility, biobakery_workflows_databases)

Inputs

  1. A set of fastq (or fastq.gz) files (single-end or paired-end) raw. This workflow does not allow for multiplexed files as input. The files are expected to be named $SAMPLE.fastq.gz,$SAMPLE_R1_001.fastq.gz, or $SAMPLE_R2_001.fastq.gz where $SAMPLE is the sample name or identifier corresponding to the sequences. $SAMPLE can contain any characters except spaces or periods.

The workflow will detect if paired-end files are present. By default the workflow identifies paired end reads based on file names containing "_R1_001" and "_R2_001" strings. If your paired end reads have different identifiers, use the option --pair-identifier .R1. to provide the identifier string for the first file in the set of pairs.

To run the workflow

Visualization Workflows

bioBakery workflows includes a single universal visualization workflow for shotgun sequences and 16s data. The workflow can be run on the command line with the following syntax:

$ biobakery_workflows vis --input $INPUT --project-name $PROJECT --output $OUTPUT

A subset of files from the $OUTPUT folder of a data processing workflow can be used in the $INPUT folder to the visualization workflow. For detailed information on the input files required for the visualization workflow, see the help message for the workflow by running the command:

$ biobakery_workflows vis --help

Visualization for Whole Metagenome Shotgun and 16S (vis)

This workflow generates a document of tables, bar plots, a PCoA plot, scatter plots, and heatmaps using the output of the wmgx or 16S workflows as input.

Requirements

  1. Pweave (installed automatically)
  2. NumPy and SciPy
    1. Install with: $ pip install numpy AND $ pip install scipy
  3. Matplotlib
    1. Install with: $ pip install matplotlib
  4. LaTeX
  5. Pandoc (<version2 required)
    1. Install with: $ conda install pandoc
  6. Hclust2
    1. Install with: $ conda install -c biobakery hclust2
  7. R with the vegan package

Inputs

  1. An input folder containing the final products from the wmgx or 16S data workflow.
    1. OPTIONAL: A file of the KneadData read counts for the wmgx samples (single or paired end).
    2. A file of the merged taxonomic profile or closed reference OTU table or ASV table.
    3. OPTIONAL: A file of the merged pathway abundances (normalized).
    4. OPTIONAL: A file of the HUMAnN alignment counts.
    5. OPTIONAL: A file of the HUMAnN feature counts.
    6. OPTIONAL: A file of the read counts per sample (including total reads, classified, and unclassified).
    7. OPTIONAL: A file of the eestats for all samples.
    8. The log file from the corresponding data processing workflow.
  2. The project name (Optional).
  3. Introduction text (Optional).
  4. The report format (Options: pdf/html, pdf is default).

Outputs

  1. A pdf (or html) report.
  2. A directory of figures included in the report.
    1. Quality control section
      1. Paired end read count table and barchart
      2. Orphan read count table and barchart
      3. Microbial read proportion table
    2. Taxonomy section
      1. Species count table
      2. Ordination (PCoA)
      3. Heatmap of top species abundances
      4. Stacked barplot of top species abundances
    3. Functional profiling section
      1. Pathway abundance
        1. Top pathways by abundance (heatmap and table)
        2. Top pathways by variance (heatmap and table) ## Feature counts section
      2. Scatter plot of the aligned reads (nucleotide/translated)
      3. Scatter plot of the gene families counts (nucleotide/translated)
      4. Scatter plot of the ECs counts (nucleotide/translated)
      5. Scatter plot of the pathway counts (nucleotide/translated)
  3. A directory of read count tables
  4. A zip archive containing the pdf (or html) report, figures, and data files

To run the workflow

Stats workflow

The stats workflow takes as input feature tables generated from the wmgx or 16s workflows. It can also be used with any tab-delimited feature table.

$ biobakery_workflows stats --input $INPUT --input-metadata $INPUT_METADATA --output $OUTPUT --project-name $PROJECT

Requirements

  1. MaAsLin2
  2. HUMAnN (for pathways barcharts stratified by species)
  3. HAllA
  4. R plus vegan, ggplot2, optparse, gridExtra, permute, reshape2, RColorBrewer, cowplot, plyr, ade4, viridis

Inputs

The workflow requires five arguments and allows for seventeen optional arguments.

Workflow arguments can be provided on the command line or with an optional config file using the AnADAMA2 built-in option "--config ".

Required

  1. INPUT: The folder containing the input data files.

    • Can be txt, tsv or biom (also compressed okay, like PICRUSt2 outputs)
    • Looks for taxonomic profile, pathway abundance, ec abundance, etc.
    • Can identify filetype based on contents.
    • Works with wmgx and 16s input files.
    • Data file can have the samples as rows or columns.
  2. INPUT_METADATA: The metadata file for input.

    • File can have the samples as rows or columns.
  3. OUTPUT: The folder to write the output files. 

  4. PROJECT: The name of the project (string for the report header).

Optional

  1. Transform: Used with MaAsLin2.
  2. Longitudinal: Set if the study is longitudinal (default is univariate/multivariate).
  3. Fixed effects: Used to generate the model (MaAsLin2 and beta diversity).
  4. Multivariate fixed effects: Used to generate the model (MaAsLin2 and beta diversity). These variables are listed first in the beta diversity covariate equation.
  5. Random effects: Used to generate the model (MaAsLin2).
  6. Bypass Maaslin: Skip running MaAsLin2 on all input files.
  7. Permutations: Total number of permutations to apply with the permanova (default=4999). (Only for longitudinal studies)
  8. Static covariates: Metadata variable names. (Only for longitudinal studies)
  9. Scale: The scale to apply to the permanova. (Only for longitudinal studies)
  10. Min abundance: Min filtering value (default 0.0001)
  11. Min prevalence: Min filtering value (default 0.1)
  12. Max missing: Max percentage of values for a metadata variable that are missing (default 20%)
  13. Format: Format for the report (html or pdf)
  14. Top pathways: Max number of pathways to display for stratified abundance plots.
  15. Metadata categorical: Names of metadata variables that are categorical. (Only for stratified pathways plots)
  16. Metadata continuous: Names of the metadata variables that are continuous. (Only for stratified pathways plots)
  17. Metadata exclude: Exclude these variables from the metadata. (Only for stratified pathways plots)
  18. Introduction text: The text written to the beginning of the report.

Workflow steps

  1. Identify data file types.

    • Reads through the start of all tsv/txt and biom files in the input folder to find taxonomic profiles, pathway abundances, ec abundances, etc (with files of unknown referenced in the report with their name).
    • Looks for files of those types for both wmgx and 16s.
    • Requires at least one taxonomic profile.
  2. Determine the study type based on the input data files (ie. wmgx or 16s).

  3. If biom files are provided, convert biom files to tsv.

  4. Checks for sample names in feature tables that are not included in metadata file. Throws an error to request the user add the sample names to the metadata file.

  5. Create feature tables for all input files. These files are compatible with all downstream processing tasks (ie maaslin2, humann_barplots).

  6. Run mantel tests compairing all data files input.

  7. If pathway abundance files are provided, generate stratified pathways plots.

    • For the top N pathways (based on significant results from maaslin2), generate plot for each categorical metadata variable.
  8. If longitudinal, run the permanova script.

  9. If not longitudinal, run the beta diversity script to generate stacked barplots of R-squared and P Values for adonis run on each metadata variable and again on all variables.

  10. Run MaAsLin2 on all input files.

  11. Run HAllA on all input files (minus gene families due to size).

  12. Create a report with figures.

Run a demo

Using the HMP2 (IBDMDB) merged data files provided by the project, run the stats workflow to generate a report.

$ biobakery_workflows stats --input HMP2_data/ --input-metadata HMP2_metadata.tsv --fixed-effects="diagnosis,dysbiosisnonIBD,dysbiosisUC,dysbiosisCD,antibiotics,age" --random-effects="site,subject" --project-name HMP2 --output HMP2_stats_output --longitudinal  --static-covariates="age" --permutations 10 --maaslin-options="reference='diagnosis,nonIBD'"

The files in the input folder are the taxonomic profile, pathway abundance, and ec abundance. Fixed and random effect variables are specified for the MaAsLin2 runs. The metadata type selected is longitudinal and the static covariate in the study metadata is "age". The reduced number of permutations reduces the runtime for the three permanova calculations. The reference is provided to MaAsLiN2 since diagnosis is a variable with more than two levels to set the reference variable of "nonIBD" for the module and the resulting box plots.

Outputs include a folder for each MaAsLin2 run plus figures folders and a report.

WDL workflow

A mtx workflow based on the bioBakery ANADAMA2 wmgx workflows.

This workflow is currently installed in the Terra workspace: https://app.terra.bio/#workspaces/rjxmicrobiome/mtx_workflow .

The WDL is located in this repository at: biobakery_workflows/workflows/wtx.wdl .

Inputs

The workflow has eleven required inputs and nine optional inputs.

Required inputs The workflow requires eleven inputs for each run. Five inputs can be modified for each project where as the other six inputs would only be modified with software version changes.

To generate a file to use as input for InputRead1Files, follow the Terra instructions https://support.terra.bio/hc/en-us/articles/360033353952-Creating-a-list-file-of-reads-for-input-to-a-workflow , adding to command #2 the InputRead1Identifier and the InputExtension. For example with InputRead1Identifier = ".R1" and InputExtension = ".fastq.gz" command #2 would now be gsutil ls gs:/your_data_Google_bucket_id/ | grep ".fastq.gz" | grep ".R1" > ubams.list . Also since for this workflow we are looking for fastq or fastq.gz input files you might change the name of the file list in this command from "ubams.list" to "fastq_list.txt" .

These six required inputs would only be modified if the versions of Kneaddata and HUMAnN v2 change. These are databases that are specifically tied to the software version.

Optional inputs There are an additional ten optional inputs for each workflow run. These are not required. If not set, the default values will be used.

There are three additional optional inputs that can be used to run with one or more custom databases.

Outputs

The workflow has several intermediate outputs and a final zip archive that includes a report of exploratory figures plus compiled data tables. Each task has its own folder in the google bucket output folder with a sub-folder for each time it is run. The outputs of interest, including their respective folders, are described below. $SAMPLE_NAME is the name of the sample included in the original raw files. For example, SAMPLE1.R1.fastq.gz would have a sample name of "SAMPLE1".

Run a demo

A demo data set is included in the Terra workspace. The demo set includes six paired samples (three MTX and three MGX) from IBDMDB plus a small metadata file. Using preemptive instances, this demo set will cost about $5 to run.

IBDMDB (6 sample) demo run configuration:

Required software specific databases:

Optional custom databases (to run with one or more custom databases instead of the default references used in QC)

Refer to the section above for descriptions of the output files generated by running the workflow.

Example output files from running the IBDMDB data set with metadata can be found in this workspace in the folder IBDMDB/final_outputs/ibdmdb_demo_visualizations.zip.

Contributions

Thanks go to these wonderful people: