Zymo-Research / aladdin-shotgun

MIT License
0 stars 2 forks source link

Introduction

This is a bioinformatics analysis pipeline used for shotgun metagenomic data developed at Zymo Research. This pipeline was adpated from community-developed nf-core/taxprofiler pipeline version 1.0.0. Many changes were made to the original pipeline. Some are based on our experience or preferences. But more importantly, we want to make the pipeline and its results easier to use/understand by people without bioinformatics experience. People can run the pipeline on the point-and-click bioinformatics platform Aladdin Bioinformatics. Changes include but are not limited to:

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Pipeline summary

  1. Read QC (FastQC or falco as an alternative option)
  2. Performs optional read pre-processing (code for long-read inherited from nf-core/taxprofiler, but not separately tested by us yet)
  3. Perform Host-read removal
    • Host-read removal (short-read: BowTie2; long-read: Minimap2). This is not performed when sourmash-zymo is selected as the database because it already contains host sequences.
    • Statistics for host-read removal (Samtools)
  4. Run merging when applicable
  5. Identifies antimicrobial resistance genes in samples from database MEGARes version 3
    • Reads are aligned to MEGARes reference (bwa mem)
    • Resistome statistics are quantified and compiled for each sample (AMRplusplus)
  6. Performs taxonomic profiling using one of: (nf-core/taxprofiler has more choices for this step, if there are tools you'd like for this step, please let us know.)
  7. Merge all taxonomic profiling results into one table and perform alpha/beta diversity analysis (Qiime2).
  8. Compare user samples with already profiled reference datasets (Qiime2)
  9. Present all results in above steps in a report (MultiQC)

Quick Start

We recommend you run this pipeline via the Aladdin Bioinformatics platform. It is much easier to run without any requirement for coding. Also, because the Zymo sourmash database is private, public users would not be able to use it via the command line. If you would still like to run the pipeline via CLI, see instruction below.

Prerequisites

Using AWS Batch

nextflow run Zymo-Research/aladdin-shotgun \
    -profile awsbatch \
    --design "<path to design CSV file>" \
    --database sourmash-zymo \
    --run_amr true \
    -work-dir "<work dir on S3>" \
    --awsregion "<AWS Batch region> \
    --awsqueue "<SQS ARN>" \
    --outdir "<output dir on S3>" \
    -r "0.0.4" \
    -name "<report title>"
  1. The parameter --design is required. It must be a CSV file with the following format.
    sample,read_1,read_2,group,run_accession
    sample1,s1_run1_R1.fastq.gz,s1_run1_R2.fastq.gz,groupA,run1
    sample1,s1_run2_R1.fastq.gz,s1_run2_R2.fastq.gz,groupA,run2
    sample2,s2_run1_R1.fastq.gz,,groupB,,
    sample3,s3_run1_R1.fastq.gz,s3_run1_R2.fastq.gz,groupB,,
    • The header line must be present.
    • The columns "sample", "read_1", "read_2", "group" must be present. Column "run_accession" is optional.
    • The column "sample" contains the name/label for each sample. It can be duplicate. When duplicated, it means the same sample has multiple sequencing runs. In those cases, a different value for "run_accession" is expected. See "sample1" in above example. Sample names must contain only alphanumerical characters or underscores, and must start with a letter.
    • The columns "read_1", "read_2" refers to the paths, including S3 paths, of Read 1 and 2 of Illumina paired-end data. They must be ".fastq.gz" or ".fq.gz" files. When your data are single-end Illumina or PacBio data, simply use "read_1" column, and leave "read_2" column empty. FASTA files from Nanopore data are currently not supported.
    • The column "group" contains the group name/label for comparison purposes in the diversity analysis. If you don't have/need this information, simply leave the column empty, but this column must be present regardless. Same rules for legal characters of sample names apply here too.
    • The column "run_accesssion" is optional. It is only required when there are duplicates in the "sample" column. This is to mark different run names for the sample.
  2. The parameter --database is used to change taxonomy profiler and database. It has a default value 'sourmash-zymo'. You can skip this if you don't want to change it.
  3. The parameter --run_amr is used to run antimicrobial resistance analysis. This parameter is by default false. If you wish to skip this analysis, remove --run_amr from the command line or set it to "false".
  4. The parameters --awsregion, --awsqueue, -work-dir, and --outdir are required when running on AWS Batch, the latter two must be directories on S3.
  5. The parameter -r will run a specific release of the pipeline. If not specified, it will run the the main branch instead.
  6. The parameter -name will define the title of the MultiQC report.

There are many other options built in the pipeline to customize your run and handle specific situations, please refer to the Usage Documentation.

Using Docker

nextflow run Zymo-Research/aladdin-shotgun \
    -profile docker \
    --design "<path to design CSV file>" \
    --database sourmash-zymo

Please see above for requirements of the design CSV file.

Credits

This pipeline was adapted from nf-core/taxprofiler version 1.0.0. Please refer to credits for list of orginal contributors. Contributors from Zymo Research include: