Zymo-Research/aladdin-shotgun

Introduction

This is a bioinformatics analysis pipeline used for shotgun metagenomic data developed at Zymo Research. This pipeline was adpated from community-developed nf-core/taxprofiler pipeline version 1.0.0. Many changes were made to the original pipeline. Some are based on our experience or preferences. But more importantly, we want to make the pipeline and its results easier to use/understand by people without bioinformatics experience. People can run the pipeline on the point-and-click bioinformatics platform Aladdin Bioinformatics. Changes include but are not limited to:

Changed the behavior of pipeline so that the user may choose one taxonomy profiler instead of running all available taxonomy profilers. We found that some of the profilers have worse performances or outdated databases. We have disabled those profilers temporarily, but kept the code to run them, with a plan to add them if necessary. This is a philosophical change, we believe this approach offers simplicity and avoids confusion for researchers.
Added sourmash as the preferred taxonomy profiler.
Added a Zymo version of the sourmash database that includes common host genomes so that host removal step does not need to be run anymore.
Upgraded MetaPhlAn3 to MetaPhlAn4.
Added antimicrobial resistance analysis with AMRplusplus.
Added visualizations of sourmash and MetaPhlAn4 results to the report.
Added diversity analysis using Qiime2 and corresponding visualizations to the report.
Fixed, simplified, and improved the report.
Made the pipeline more resistant to bad samples, so that they don't stop the processing of others.
Added a function to compare sample results to reference datasets that are already processed, so that user can quickly assess similarity of their samples to well curated samples of known phenotype, e.g. healthy/disease. This function is still experimental.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Pipeline summary

Read QC (FastQC or falco as an alternative option)
Performs optional read pre-processing (code for long-read inherited from nf-core/taxprofiler, but not separately tested by us yet)
- Adapter clipping and merging (short-read: fastp, AdapterRemoval2; long-read: porechop)
- Low complexity and quality filtering (short-read: bbduk, PRINSEQ++; long-read: Filtlong)
Perform Host-read removal
- Host-read removal (short-read: BowTie2; long-read: Minimap2). This is not performed when sourmash-zymo is selected as the database because it already contains host sequences.
- Statistics for host-read removal (Samtools)
Run merging when applicable
Identifies antimicrobial resistance genes in samples from database MEGARes version 3
- Reads are aligned to MEGARes reference (bwa mem)
- Resistome statistics are quantified and compiled for each sample (AMRplusplus)
Performs taxonomic profiling using one of: (nf-core/taxprofiler has more choices for this step, if there are tools you'd like for this step, please let us know.)
- sourmash
- MetaPhlAn4
Merge all taxonomic profiling results into one table and perform alpha/beta diversity analysis (Qiime2).
Compare user samples with already profiled reference datasets (Qiime2)
Present all results in above steps in a report (MultiQC)

Quick Start

We recommend you run this pipeline via the Aladdin Bioinformatics platform. It is much easier to run without any requirement for coding. Also, because the Zymo sourmash database is private, public users would not be able to use it via the command line. If you would still like to run the pipeline via CLI, see instruction below.

Prerequisites

Nextflow version 22.10.1 or later
At least 8 CPU threads and 60GB memory. The default config file nextflow.config has higher settings, please modify to fit your device.
Docker if you are running locally.
Permissions to AWS S3 and Batch resources if you are running on AWS Batch.

Using AWS Batch

nextflow run Zymo-Research/aladdin-shotgun \
    -profile awsbatch \
    --design "<path to design CSV file>" \
    --database sourmash-zymo \
    --run_amr true \
    -work-dir "<work dir on S3>" \
    --awsregion "<AWS Batch region> \
    --awsqueue "<SQS ARN>" \
    --outdir "<output dir on S3>" \
    -r "0.0.4" \
    -name "<report title>"

The parameter --design is required. It must be a CSV file with the following format.
```
sample,read_1,read_2,group,run_accession
sample1,s1_run1_R1.fastq.gz,s1_run1_R2.fastq.gz,groupA,run1
sample1,s1_run2_R1.fastq.gz,s1_run2_R2.fastq.gz,groupA,run2
sample2,s2_run1_R1.fastq.gz,,groupB,,
sample3,s3_run1_R1.fastq.gz,s3_run1_R2.fastq.gz,groupB,,
```
- The header line must be present.
- The columns "sample", "read_1", "read_2", "group" must be present. Column "run_accession" is optional.
- The column "sample" contains the name/label for each sample. It can be duplicate. When duplicated, it means the same sample has multiple sequencing runs. In those cases, a different value for "run_accession" is expected. See "sample1" in above example. Sample names must contain only alphanumerical characters or underscores, and must start with a letter.
- The columns "read_1", "read_2" refers to the paths, including S3 paths, of Read 1 and 2 of Illumina paired-end data. They must be ".fastq.gz" or ".fq.gz" files. When your data are single-end Illumina or PacBio data, simply use "read_1" column, and leave "read_2" column empty. FASTA files from Nanopore data are currently not supported.
- The column "group" contains the group name/label for comparison purposes in the diversity analysis. If you don't have/need this information, simply leave the column empty, but this column must be present regardless. Same rules for legal characters of sample names apply here too.
- The column "run_accesssion" is optional. It is only required when there are duplicates in the "sample" column. This is to mark different run names for the sample.
The parameter --database is used to change taxonomy profiler and database. It has a default value 'sourmash-zymo'. You can skip this if you don't want to change it.
The parameter --run_amr is used to run antimicrobial resistance analysis. This parameter is by default false. If you wish to skip this analysis, remove --run_amr from the command line or set it to "false".
The parameters --awsregion, --awsqueue, -work-dir, and --outdir are required when running on AWS Batch, the latter two must be directories on S3.
The parameter -r will run a specific release of the pipeline. If not specified, it will run the the main branch instead.
The parameter -name will define the title of the MultiQC report.

There are many other options built in the pipeline to customize your run and handle specific situations, please refer to the Usage Documentation.

Using Docker

nextflow run Zymo-Research/aladdin-shotgun \
    -profile docker \
    --design "<path to design CSV file>" \
    --database sourmash-zymo

Please see above for requirements of the design CSV file.

Credits

This pipeline was adapted from nf-core/taxprofiler version 1.0.0. Please refer to credits for list of orginal contributors. Contributors from Zymo Research include:

Zymo Research microbiomics team (source code, database, review)
Nora Sharp (Pipeline coding)
Zhenfeng Liu (Pipeline coding)