bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters. Quality control and statistical summary reports are automatically generated for most data types, which include 16S amplicons, metagenomes, and metatranscriptomes. Workflows are run directly from the command line and tasks can be imported to create your own custom workflows. The workflows and tasks are built with AnADAMA2 which allows for parallel task execution locally and in a grid compute environment.
For additional information, see the bioBakery workflows tutorial.
Table of contents
bioBakery workflows can be installed with Conda, Docker, or pip.
To install with Conda:
$ conda install -c biobakery biobakery_workflows
To install and run with Docker:
$ docker run -it biobakery/workflows bash
To install with pip:
$ pip install biobakery_workflows
Install automatically
Once the software and dependencies are installed, the databases can be installed automatically.
Run the following command to install the databases required for a workflow:
$ biobakery_workflows_databases --install $WORKFLOW
$WORKFLOW
with the workflow name (ie wmgx, 16s,
wmgx_wmtx, or wmgx_demo, isolate_assembly).$HOME/biobakery_workflow_databases/
or
/opt/biobakery_workflow_databases/
depending on permissions.--location $FOLDER
.
With this option you will also need to set the environment variable
$BIOBAKERY_WORKFLOWS_DATABASES
to the folder so the workflows can
find the installed databases.Install manually
Alternatively the databases can be installed manually and then referenced with environment variables. The shotgun data processing workflows require Kneaddata (human, human transcriptome, and SILVA), HUMAnN (utility mapping, nucleotide, and protein databases), and StrainPhlAn (reference and marker) databases while the 16s data processing workflow requires the GreenGenes fasta, taxonomy, and usearch formatted files.
When manually installing the databases, the following environment variables need to be set.
KNEADDATA_DB_HUMAN_GENOME
,
KNEADDATA_DB_RIBOSOMAL_RNA
, KNEADDATA_DB_HUMAN_TRANSCRIPTOME
,
STRAINPHLAN_DB_REFERENCE
, and STRAINPHLAN_DB_MARKERS
.GREEN_GENES_USEARCH_DB
, GREEN_GENES_FASTA_DB
, and
GREEN_GENES_TAXONOMY_DB
.All workflows follow the general command format:
$ biobakery_workflows $WORKFLOW --input $INPUT --output $OUTPUT
For a list of all available workflows, run:
$ biobakery_workflows --help
For specific options for a workflow, run:
$ biobakery_workflows $WORKFLOW --help
The basic command to run a data processing workflow, replacing
$WORKFLOW
with the workflow name, is:
$ biobakery_workflows $WORKFLOW --input $INPUT_DIR --output $DATA_OUTPUT_DIR
This command will run the workflow on the files in the input folder
($INPUT_DIR
to be replaced with the path to the folder containing
fastq files). It will write files to the output folder
($DATA_OUTPUT_DIR
to be replaced with the folder to write output
files).
A single visualization workflow exists that can be used for any data processing
workflow. The basic command to run a visualization workflow, replacing
$WORKFLOW_VIS
with the visualization workflow name, is:
$ biobakery_workflows $WORKFLOW_VIS --input $DATA_OUTPUT_DIR --project-name $PROJECT --output $OUTPUT_DIR
The input folder ($DATA_OUTPUT_DIR
to be replaced with the path to the
folder) in this command this is a subset of the output folder from the data processing
workflow; Run the workflow with the option --help
to determine which files are required
and which are optional to run the workflow. The folder ($OUTPUT_DIR
to be replaced with the path to the
output folder) will contain the output files from the visualization
workflow. The project name should replace $PROJECT
in the command so
the report can include the name.
When running any workflow you can add the following command line options to make use of existing computing resources:
--local-jobs <1>
: Run multiple tasks locally in parallel. Provide
the max number of tasks to run at once. The default is one task
running at a time.--grid-jobs <0>
: Run multiple tasks on a grid in parallel.
Provide the max number of grid jobs to run at once. The default is
zero tasks are submitted to a grid resulting in all tasks running
locally.--grid <slurm>
: Set the grid available on your machine. This will
default to the grid found on the machine with options of slurm and
sge.--partition <serial_requeue>
: Jobs will be submitted to the
partition selected. The default partition selected is based on the
default grid.For additional workflow options, see the AnADAMA2 user manual.
bioBakery workflows includes a collection of workflows for shotgun sequences and 16s data processing. Most workflows can be run on the command line with the following syntax:
$ biobakery_workflows $WORKFLOW --input $INPUT --output $OUTPUT
See the section on parallelization options to optimize the workflow run based on your computing resources.
Super Tasks
Requirements
$ conda install -c biobakery kneaddata
OR
$ pip install kneaddata
$ conda install -c bioconda metaphlan
$ conda install -c biobakery humann
OR
$ pip install humann
$ conda install -c bioconda strainphlan
Inputs
$SAMPLE.fastq.gz
,$SAMPLE.R1.fastq.gz
, or $SAMPLE.R2.fastq.gz
where $SAMPLE
is the sample name or identifier corresponding to
the sequences. $SAMPLE
can contain any characters except spaces or
periods.The workflow will detect if paired-end files are present. By default the
workflow identifies paired end reads based on file names containing
".R1" and ".R2" strings. If your paired end reads have different
identifiers, use the option --pair-identifier .R1
to provide the
identifier string for the first file in the set of pairs.
The workflow by default expects input files with the extension
"fastq.gz". If your files are not gzipped, run with the option
--input-extension fastq
.
To run the workflow
$ biobakery_workflows wmgx --input $INPUT --output $OUTPUT
$INPUT
with the path to the folder
containing your fastq input files and $OUTPUT
with the path to the
folder to write output files.--qc-options="$OPTIONS"
will
modify the default settings when running the KneadData subtask and
--strain-profiling-options="$OPTIONS"
will modify the options when
running the StrainPhlAn subtask (replacing the $OPTIONS
in each
with your selected settings).--run-assembly
to add the tasks to run assembly.To run a demo
$ biobakery_workflows wmgx --input examples/wmgx/single/ --output workflow_output
$ biobakery_workflows wmgx --input examples/wmgx/paired/ --output workflow_output
examples
folder.Super Tasks
Requirements
$ conda install -c biobakery kneaddata
OR
$ pip install kneaddata
$ conda install -c bioconda metaphlan
$ conda install -c biobakery humann
OR
$ pip install humann
$ conda install -c bioconda strainphlan
Inputs
$SAMPLE.fastq.gz
,$SAMPLE.R1.fastq.gz
, or $SAMPLE.R2.fastq.gz
where $SAMPLE
is the sample name or identifier corresponding to
the sequences. $SAMPLE
can contain any characters except spaces or
periods.The workflow will detect if paired-end files are present. By default the
workflow identifies paired end reads based on file names containing
".R1" and ".R2" strings. If your paired end reads have different
identifiers, use the option --pair-identifier .R1
to provide the
identifier string for the first file in the set of pairs.
The workflow by default expects input files with the extension
"fastq.gz". If your files are not gzipped, run with the option
--input-extension fastq
.
To run the workflow
$ biobakery_workflows wmgx_wmtx --input-metagenome $INPUT_WMS --input-metatranscriptome $INPUT_WTS --input-mapping $INPUT_MAPPING --output $OUTPUT
$INPUT_WMS
with the path to the folder
containing your whole metagenome shotgun fastq.gz input files,
$INPUT_WTS
with the path to the folder containing your whole
metatranscriptome shotgun fastq.gz input files, and $OUTPUT
with
the path to the folder to write output files. Replace
$INPUT_MAPPING
with your file of mapping between the metagenome
and metatranscriptome samples.--qc-options="$OPTIONS"
.To run a demo
$ biobakery_workflows wmgx_wmtx --input-metagenome examples/wmgx_wmtx/wms/ --input-metatranscriptome examples/wmgx_wmtx/wts/ --input-mapping examples/wmgx_wmtx/mapping.tsv --output workflow_output
examples
folder.The 16s workflow has two methods that can be used: UPARSE (with either USEARCH or VSEARCH (default)) and DADA2. All methods perform quality control and generate taxonomic tables.
Workflow diagrams
Super Tasks
Requirements
$ conda install -c bioconda picrust
$ conda install -c bioconda picrust2
$ pip install biom-format
$ conda install -c bioconda clustal-omega
$ conda install -c bioconda ea-utils
$ conda install -c bioconda fasttree
Inputs
$SAMPLE.fastq.gz
,$SAMPLE_R1_001.fastq.gz
, or
$SAMPLE_R2_001.fastq.gz
where $SAMPLE
is the sample name or
identifier corresponding to the sequences. $SAMPLE
can contain any
characters except spaces or periods.The workflow will detect if paired-end files are present. By default the
workflow identifies paired end reads based on file names containing
"_R1_001" and "_R2_001" strings. If your paired end reads have
different identifiers, use the option --pair-identifier .R1.
to
provide the identifier string for the first file in the set of pairs.
The workflow by default expects input files with the extension
"fastq.gz". If your files are not gzipped, run with the option
--input-extension fastq
.
To run the workflow
$ biobakery_workflows 16s --input $INPUT --output $OUTPUT
$INPUT
with the path to the folder
containing your fastq input files and $OUTPUT
with the path to the
folder to write output files.--method dada2
.--trunc-len-max 200
, if running the VSEARCH/USEARCH method, to a
smaller value. Reading through the maxee table will help to
determine the length to use for trimming based on the joined reads
and their quality scores. For other default settings, please run the
workflow with the --help
option. All of the other settings will
work for most data sets. If there are any you would like to change,
please review the usearch documentation to determine the optimal
settings.--method dada2
to run the DADA2 method instead of
VSEARCH.--amplicon-length <N>
to run FIGARO to estimate truncation
lengths when running with the DADA2 method.This workflow will assemble and annotate sequenced microbial isolate genomes. It runs the raw sequences through quality control, assembly (with SPAdes), annotation (with Prokka), functional annotation, quality assessment, and then creates a final annotated contig file.
Workflow diagram
Super Tasks
Requirements
$ conda install -c biobakery kneaddata
OR
$ pip install kneaddata
$ conda install -c bioconda spades
$ conda install -c bioconda prokka
$ conda install -c bioconda quast
Inputs
$SAMPLE.fastq.gz
,$SAMPLE_R1_001.fastq.gz
, or
$SAMPLE_R2_001.fastq.gz
where $SAMPLE
is the sample name or
identifier corresponding to the sequences. $SAMPLE
can contain any
characters except spaces or periods.The workflow will detect if paired-end files are present. By default the
workflow identifies paired end reads based on file names containing
"_R1_001" and "_R2_001" strings. If your paired end reads have
different identifiers, use the option --pair-identifier .R1.
to
provide the identifier string for the first file in the set of pairs.
To run the workflow
$ biobakery_workflows isolate_assembly --input $INPUT --species-name $SPECIES --output $OUTPUT
$INPUT
with the path to the folder
containing your fastq input files, $SPECIES
with the name of the
isolate sequenced and $OUTPUT
with the path to the folder to write
output files. The $SPECIES
input string is used as the basename of
the contig files.bioBakery workflows includes a single universal visualization workflow for shotgun sequences and 16s data. The workflow can be run on the command line with the following syntax:
$ biobakery_workflows vis --input $INPUT --project-name $PROJECT --output $OUTPUT
A subset of files from the $OUTPUT
folder of a data processing workflow can be used in the
$INPUT
folder to the visualization workflow. For
detailed information on the input files required for the visualization
workflow, see the help message for the workflow by running the command:
$ biobakery_workflows vis --help
This workflow generates a document of tables, bar plots, a PCoA plot, scatter plots, and heatmaps using the output of the wmgx or 16S workflows as input.
Requirements
$ pip install numpy
AND $ pip install scipy
$ pip install matplotlib
$ conda install pandoc
$ conda install -c biobakery hclust2
Inputs
Outputs
To run the workflow
$ biobakery_workflows vis --input $INPUT --project-name $PROJECT --output $OUTPUT
$INPUT
with a folder containing files from the output folder created by
running the wmgx or 16S data processing workflow, $PROJECT
with the name
of the project, and $OUTPUT
with the path to the folder to write
output files.The stats workflow takes as input feature tables generated from the wmgx or 16s workflows. It can also be used with any tab-delimited feature table.
$ biobakery_workflows stats --input $INPUT --input-metadata $INPUT_METADATA --output $OUTPUT --project-name $PROJECT
Requirements
Inputs
The workflow requires five arguments and allows for seventeen optional arguments.
Workflow arguments can be provided on the command line or with an optional config file using the AnADAMA2 built-in option "--config
Required
INPUT: The folder containing the input data files.
INPUT_METADATA: The metadata file for input.
OUTPUT: The folder to write the output files.
PROJECT: The name of the project (string for the report header).
Optional
Workflow steps
Identify data file types.
Determine the study type based on the input data files (ie. wmgx or 16s).
If biom files are provided, convert biom files to tsv.
Checks for sample names in feature tables that are not included in metadata file. Throws an error to request the user add the sample names to the metadata file.
Create feature tables for all input files. These files are compatible with all downstream processing tasks (ie maaslin2, humann_barplots).
Run mantel tests compairing all data files input.
If pathway abundance files are provided, generate stratified pathways plots.
If longitudinal, run the permanova script.
If not longitudinal, run the beta diversity script to generate stacked barplots of R-squared and P Values for adonis run on each metadata variable and again on all variables.
Run MaAsLin2 on all input files.
Run HAllA on all input files (minus gene families due to size).
Create a report with figures.
Run a demo
Using the HMP2 (IBDMDB) merged data files provided by the project, run the stats workflow to generate a report.
$ biobakery_workflows stats --input HMP2_data/ --input-metadata HMP2_metadata.tsv --fixed-effects="diagnosis,dysbiosisnonIBD,dysbiosisUC,dysbiosisCD,antibiotics,age" --random-effects="site,subject" --project-name HMP2 --output HMP2_stats_output --longitudinal --static-covariates="age" --permutations 10 --maaslin-options="reference='diagnosis,nonIBD'"
The files in the input folder are the taxonomic profile, pathway abundance, and ec abundance. Fixed and random effect variables are specified for the MaAsLin2 runs. The metadata type selected is longitudinal and the static covariate in the study metadata is "age". The reduced number of permutations reduces the runtime for the three permanova calculations. The reference is provided to MaAsLiN2 since diagnosis is a variable with more than two levels to set the reference variable of "nonIBD" for the module and the resulting box plots.
Outputs include a folder for each MaAsLin2 run plus figures folders and a report.
A mtx workflow based on the bioBakery ANADAMA2 wmgx workflows.
This workflow is currently installed in the Terra workspace: https://app.terra.bio/#workspaces/rjxmicrobiome/mtx_workflow .
The WDL is located in this repository at: biobakery_workflows/workflows/wtx.wdl
.
Inputs
The workflow has eleven required inputs and nine optional inputs.
Required inputs The workflow requires eleven inputs for each run. Five inputs can be modified for each project where as the other six inputs would only be modified with software version changes.
To generate a file to use as input for InputRead1Files, follow the Terra instructions https://support.terra.bio/hc/en-us/articles/360033353952-Creating-a-list-file-of-reads-for-input-to-a-workflow , adding to command #2 the InputRead1Identifier and the InputExtension. For example with InputRead1Identifier = ".R1" and InputExtension = ".fastq.gz" command #2 would now be
gsutil ls gs:/your_data_Google_bucket_id/ | grep ".fastq.gz" | grep ".R1" > ubams.list
. Also since for this workflow we are looking for fastq or fastq.gz input files you might change the name of the file list in this command from "ubams.list" to "fastq_list.txt" .
These six required inputs would only be modified if the versions of Kneaddata and HUMAnN v2 change. These are databases that are specifically tied to the software version.
databases/humann/full_chocophlan_plus_viral.v0.1.1.tar.gz
in this workspace google bucket.databases/kneaddata/Homo_sapiens_hg37_human_contamination_Bowtie2_v0.1.tar.gz
in this workspace google bucket.databases/kneaddata/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz
in this workspace google bucket. databases/kneaddata/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.2.tar.gz
in this workspace google bucket. databases/humann/uniref90_annotated_1_1.tar.gz
in this workspace google bucket.databases/humann/full_utility_mapping_1_1.tar.gz
in this workspace google bucket.Optional inputs There are an additional ten optional inputs for each workflow run. These are not required. If not set, the default values will be used.
There are three additional optional inputs that can be used to run with one or more custom databases.
Outputs
The workflow has several intermediate outputs and a final zip archive that includes a report of exploratory figures plus compiled data tables. Each task has its own folder in the google bucket output folder with a sub-folder for each time it is run. The outputs of interest, including their respective folders, are described below. $SAMPLE_NAME
is the name of the sample included in the original raw files. For example, SAMPLE1.R1.fastq.gz would have a sample name of "SAMPLE1".
$SAMPLE_NAME.fastq.gz
: This is the file of reads after running through QC.$SAMPLE_NAME.log
: This is the log from Kneaddata that includes read counts.glob*/$SAMPLE_NAME_DB_contam*.fastq.gz
: These are the reads that mapped to the reference database (with name $DB
) for this sample.glob*/$SAMPLE_NAME_[R1|R2].[html|zip]
: These are the output files from running fastqc on read1 and read2 prior to running quality control.$SAMPLE_NAME.log
: This is the log from the HUMAnN v2.0 that includes read alignment counts.glob*/$SAMPLE_NAME_bowtie2_unaligned.fa
: These are the unaligned reads from running the nucleotide search.glob*/$SAMPLE_NAME_diamond_unaligned.fa
: These are the unaligned reads from running the translated search.$PROJECT_NAME_visualization.zip
: This folder contains a visualization report plus final compiled data tables.wmgx_report.pdf
: This is the exploratory report of tables and figures.data/humann2_feature_counts.tsv
: This contains the feature counts (pathways, gene families, ecs) for each sample.data/humann2_read_and_species_counts.tsv
: This contains the counts of reads aligning at each step plus the total number of species identified for each sample.data/kneaddata_read_count_table.tsv
: This contains the read counts (split into pairs and orphans) for each step in the quality control process for each sample.data/metaphlan2_taxonomic_profiles.tsv
: This contains the merged taxonomic profiles for all samples.data/microbial_counts_table.tsv
: This table includes counts ratios for each step of the quality control process for all samples.data/pathabundance_relab.tsv
: This is a merged table of the pathway abundances for all samples normalized to relative abundance.data/qc_counts_orphans_table.tsv
: This is table with the total number of orphan reads not aligning to each of the reference tables.data/qc_counts_pairs_table.tsv
: This is table with the total number of paired reads not aligning to each of the reference tables.data/taxa_counts_table.tsv
: This table includes the total number of species and genera before and after filtering.data/top_average_pathways_names.tsv
: This table includes the top pathways by average abundance, with their full names, including average abundance and variance.Run a demo
A demo data set is included in the Terra workspace. The demo set includes six paired samples (three MTX and three MGX) from IBDMDB plus a small metadata file. Using preemptive instances, this demo set will cost about $5 to run.
IBDMDB (6 sample) demo run configuration:
"ibdmdb_demo"
(this can be any string you would like)".fastq.gz"
"_R1"
"_R2"
"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/ibdmdb_file_list.txt"
"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/ibdmdb_demo_metadata.txt"
Required software specific databases:
"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/humann/full_chocophlan_plus_viral.v0.1.1.tar.gz"
"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/Homo_sapiens_hg37_human_contamination_Bowtie2_v0.1.tar.gz"
"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.2.tar.gz"
"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz"
"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/humann/uniref90_annotated_1_1.tar.gz"
Optional custom databases (to run with one or more custom databases instead of the default references used in QC)
"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/Clupus_bowtie2.tar.gz"
"gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/databases/kneaddata/ClupusRNA_bowtie2.tar.gz"
Refer to the section above for descriptions of the output files generated by running the workflow.
Example output files from running the IBDMDB data set with metadata can be found in this workspace in the folder IBDMDB/final_outputs/ibdmdb_demo_visualizations.zip
.
Thanks go to these wonderful people: