epi2me-labs / wf-metagenomics

Metagenomic classification of long-read sequencing data
Other
45 stars 21 forks source link

Metagenomics workflow

Taxonomic classification from shotgun metagenomics sequencing.

Introduction

This workflow can be used for the following:

Additional features:

Compute requirements

Recommended requirements:

Minimum requirements:

Approximate run time: ~40min for 1 million reads in total (24 barcodes) using Kraken2 and the Standard-8 database (using a previously downloaded db).

ARM processor support: True

Install and run

These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME application.

The workflow uses nextflow to manage compute and software resources. Therefore, nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker or Singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either Docker or Singularity is installed. This is controlled by the -profile parameter as exemplified in the example below.

It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.

The following command can be used to obtain the workflow. This will pull the repository into the assets folder of nextflow and provide a list of all parameters available for the workflow as well as an example command:

nextflow run epi2me-labs/wf-metagenomics --help 

A demo dataset is provided for testing of the workflow. It can be downloaded using:

wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/wf-metagenomics-demo.tar.gz
tar -xzvf wf-metagenomics-demo.tar.gz

The workflow can be run with the demo data using:

nextflow run epi2me-labs/wf-metagenomics \
--fastq wf-metagenomics-demo/test_data/ \
-profile standard 

For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/

Related protocols

This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.

Find related protocols in the Nanopore community.

Input example

This workflow accepts either FASTQ or BAM files as input.

The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with --sample. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.

(i)                     (ii)                 (iii)    
input_reads.fastq   ─── input_directory  ─── input_directory
                        ├── reads0.fastq     ├── barcode01
                        └── reads1.fastq     │   ├── reads0.fastq
                                             │   └── reads1.fastq
                                             ├── barcode02
                                             │   ├── reads0.fastq
                                             │   ├── reads1.fastq
                                             │   └── reads2.fastq
                                             └── barcode03
                                              └── reads0.fastq

Input parameters

Input Options

Nextflow parameter name Type Description Help Default
fastq string FASTQ files to use in the analysis. This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.
bam string BAM or unaligned BAM (uBAM) files to use in the analysis. This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.
classifier string Kraken2 or Minimap2 workflow to be used for classification of reads. Use Kraken2 for fast classification and minimap2 for finer resolution, see Readme for further info. kraken2
analyse_unclassified boolean Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. False
exclude_host string A FASTA or MMI file of the host reference. Reads that align with this reference will be excluded from the analysis.

Real Time Analysis Options

Nextflow parameter name Type Description Help Default
real_time boolean Enable to continuously watch the input directory for new input files. Reads will be analysed as they appear This option enables the use of Nextflow’s directory watching feature to constantly monitor input directories for new files. As soon as files are written by an external process Nextflow will begin analysing these files. The workflow will accumulate data over time to produce an updating report. False
batch_size integer Maximum number of sequence records to process in a batch. Large files will be split such that batch_size records are processed together. Set to 0 to avoid rebatching input files. A value of 32000 is recommended to rebatch large files. 0
read_limit integer Stop processing data when a particular number of reads have been analysed. By default the workflow will run indefinitely. Sets the upper bound on the number of reads that will be analysed before the workflow is automatically stopped and no more data is analysed.
port integer Network port for communication between Kraken2 server and clients (available in real time pipeline). The workflow uses a server process to handle Kraken2 classification requests. This allows the workflow to persist the sequence database in memory throughout the duration of processing. The option specifies the local network port on which the server and clients will communicate. 8080
host string Network hostname (or IP address) for communication between Kraken2 server and clients. (See also 'external_kraken2' parameter). (Available in real time pipeline). The workflow uses a server process to handle Kraken2 classification requests. This allows the workflow to persist the sequence database in memory throughout the duration of processing. The option specifies the local network hostname (or IP address) of the Kraken server. localhost
external_kraken2 boolean Whether a pre-existing Kraken2 server should be used, rather than creating one as part of the workflow. (Available in real time pipeline). By default the workflow assumes that it is running on a single host computer, and further that it should start its own Kraken2 server. It may be desirable to start a Kraken2 server outside of the workflow, in which case this option should be enabled. This option may be used in conjunction with the host option to specify that the Kraken2 server is running on a remote computer. False
server_threads integer Number of CPU threads used by the Kraken2 server for classifying reads. (Available in the real_time pipeline). For the real-time Kraken2 workflow, this is the number of CPU threads used by the Kraken2 server for classifying reads. 2
kraken_clients integer Number of clients that can connect at once to the Kraken-server for classifying reads. (Available in the real_time pipeline). For the real-time Kraken2 workflow, this is the number of clients sending reads to the server. It should not be set to more than 4 fewer than the executor CPU limit. 2

Sample Options

Nextflow parameter name Type Description Help Default
sample_sheet string A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. Disabled in the real time pipeline. The sample sheet is a CSV file with, minimally, columns named barcode,alias. Extra columns are allowed. A type column is required for certain workflows and should have the following values; test_sample, positive_control, negative_control, no_template_control.
sample string A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. Disabled in the real time pipeline.

Reference Options

Nextflow parameter name Type Description Help Default
database_set string Sets the reference, databases and taxonomy datasets that will be used for classifying reads. Choices: ['ncbi_16s_18s','ncbi_16s_18s_28s_ITS', 'SILVA_138_1', 'Standard-8', 'PlusPF-8', 'PlusPFP-8']. Memory requirement will be slightly higher than the size of the database. Standard-8, PlusPF-8 and PlusPFP-8 databases require more than 8GB. This setting is overridable by providing an explicit taxonomy, database or reference path in the other reference options. Standard-8
database string Not required but can be used to specifically override Kraken2 database [.tar.gz or Directory]. By default uses database chosen in database_set parameter.
taxonomy string Not required but can be used to specifically override taxonomy database. Change the default to use a different taxonomy file [.tar.gz or directory]. By default NCBI taxonomy file will be downloaded and used.
reference string Override the FASTA reference file selected by the database_set parameter. It can be a FASTA format reference sequence collection or a minimap2 MMI format index. This option should be used in conjunction with the database parameter to specify a custom database.
ref2taxid string Not required but can be used to specify a ref2taxid mapping. Format is .tsv (refname taxid), no header row. By default uses ref2taxid for option chosen in database_set parameter.
taxonomic_rank string Returns results at the taxonomic rank chosen. In the Kraken2 pipeline: set the level that Bracken will estimate abundance at. Default: S (species). Other possible options are K (kingdom level), P (phylum), C (class), O (order), F (family), and G (genus). S

Kraken2 Options

Nextflow parameter name Type Description Help Default
bracken_length integer Set the length value Bracken will use Should be set to the length used to generate the kmer distribution file supplied in the Kraken database input directory. For the default datasets these will be set automatically. ncbi_16s_18s = 1000 , ncbi_16s_18s_28s_ITS = 1000 , PlusPF-8 = 300
kraken2_memory_mapping boolean Avoids loading database into RAM Kraken 2 will by default load the database into process-local RAM; this flag will avoid doing so. It may be useful if the available RAM memory is lower than the size of the chosen database. False
include_kraken2_assignments boolean A per sample TSV file that indicates how each input sequence was classified as well as the taxon that has been assigned to each read. The TSV's will only be output on completion of the workflow and therefore not at all if using the real time option whilst running indefinitely. False
kraken2_confidence number Kraken2 Confidence score threshold. Default: 0.0. Valid interval: 0-1 Apply a threshold to determine if a sequence is classified or unclassified. Please visit the following link for further details about how it works: https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring. 0.0

Minimap2 Options

Nextflow parameter name Type Description Help Default
minimap2filter string Filter output of minimap2 by taxids inc. child nodes, E.g. "9606,1404" Provide a list of taxids if you are only interested in certain ones in your minimap2 analysis outputs.
minimap2exclude boolean Invert minimap2filter and exclude the given taxids instead Exclude a list of taxids from analysis outputs. False
split_prefix boolean Enable if using a very large reference with minimap2 If reference fasta large enough to require multipart index, set to true to use split-prefix option with minimap2. False
keep_bam boolean Copy bam files into the output directory. It also creates the configuration and reduced reference files needed to load the alignments in IGV. False
minimap2_by_reference boolean Add a table with the mean sequencing depth per reference, standard deviation and coefficient of variation. It adds a scatterplot of the sequencing depth vs. the coverage and a heatmap showing the depth per percentile to the report False
min_percent_identity number Minimum percentage of identity with the matched reference to define a sequence as classified; sequences with a value lower than this are defined as unclassified. 90
min_ref_coverage number Minimum coverage value to define a sequence as classified; sequences with a coverage value lower than this are defined as unclassified. Use this option if you expect reads whose lengths are similar to the references' lengths. 0

Antimicrobial Resistance Options

Nextflow parameter name Type Description Help Default
amr boolean Scan reads for antimicrobial resistance or virulence genes Reads will be scanned using abricate and the chosen database (--amr_db) to identify any acquired antimicrobial resistance or virulence genes found present in the dataset. NOTE: It cannot identify mutational resistance genes. False
amr_db string Database of antimicrobial resistance or virulence genes to use. resfinder
amr_minid integer Threshold of required identity to report a match between a gene in the database and fastq reads. Valid interval: 0-100 80
amr_mincov integer Minimum coverage (breadth-of) threshold required to report a match between a gene in the database and fastq reads. Valid interval: 0-100. 80

Report Options

Nextflow parameter name Type Description Help Default
abundance_threshold number Remove those taxa whose abundance is equal or lower than the chosen value. To remove taxa with abundances lower than or equal to a relative value (compared to the total number of reads) use a decimal between 0-1 (1 not inclusive). To remove taxa with abundances lower than or equal to an absolute value, provide a number larger or equal to 1. 0
n_taxa_barplot integer Number of most abundant taxa to be displayed in the barplot. The rest of taxa will be grouped under the "Other" category. 9

Output Options

Nextflow parameter name Type Description Help Default
out_dir string Directory for output of all user-facing files. output

Advanced Options

Nextflow parameter name Type Description Help Default
min_len integer Specify read length lower limit. Any reads shorter than this limit will not be included in the analysis. 0
min_read_qual number Specify read quality lower limit. Any reads with a quality lower than this limit will not be included in the analysis.
max_len integer Specify read length upper limit Any reads longer than this limit will not be included in the analysis.
threads integer Maximum number of CPU threads to use in each parallel workflow task. Several tasks in this workflow benefit from using multiple CPU threads. This option sets the number of CPU threads for all such processes. See server threads parameter for Kraken specific threads in the real_time pipeline. 4

Outputs

Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

Title File path Description Per sample or aggregated
workflow report ./wf-metagenomics-report.html Report for all samples. aggregated
Abundance table with counts per taxa ./abundancetable{{ taxonomic_rank }}.tsv Per-taxa counts TSV, including all samples. aggregated
Bracken report file ./bracken/{{ alias }}.kraken2_bracken.report TSV file with the abundance of each taxa. See more info here: https://github.com/jenniferlu717/Bracken#output-kraken-style-bracken-report. per-sample
Kraken2 taxonomic assignment per read (Kraken2 pipeline) ./kraken2/{{ alias }}.kraken2.report.txt Lineage-aggregated counts. See more info here: https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#sample-report-output-format. per-sample
Kraken2 taxonomic asignment per read (Kraken2 pipeline) ./kraken2/{{ alias }}.kraken2.assignments.tsv TSV file with the taxonomic assignment per read. See more info here: https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format. per-sample
Host BAM file ./host_bam/{{ alias }}.bam BAM file generated from mapping filtered input reads to the host reference. per-sample
BAM index file of host reads ./host_bam/{{ alias }}.bai BAM index file generated from mapping filtered input reads to the host reference. per-sample
BAM file (minimap2) ./bams/{{ alias }}.reference.bam BAM file generated from mapping filtered input reads to the reference. per-sample
BAM index file (minimap2) ./bams/{{ alias }}.reference.bam.bai Index file generated from mapping filtered input reads to the reference. per-sample
BAM flagstat (minimap2) ./bams/{{ alias }}.bamstats_results/bamstats.flagstat.tsv Mapping results per reference per-sample
Minimap2 alignment statistics (minimap2) ./bams/{{ alias }}.bamstats_results/bamstats.readstats.tsv.gz Per read stats after aligning per-sample
JSON file with identified AMR genes (amr) ./amr/{{ alias }}.json JSON file with abricate results. See more info here: https://github.com/tseemann/abricate#output. per-sample
Reduced reference FASTA file ./igv_reference/reduced_reference.fasta.gz Reference FASTA file containing only those sequences that have reads mapped against them. aggregated
Index of the reduced reference FASTA file ./igv_reference/reduced_reference.fasta.gz.fai Index of the reference FASTA file containing only those sequences that have reads mapped against them. aggregated
GZI index of the reduced reference FASTA file ./igv_reference/reduced_reference.fasta.gz.gzi Index of the reference FASTA file containing only those sequences that have reads mapped against them. aggregated
JSON configuration file for IGV browser ./igv.json JSON configuration file to be loaded in IGV for visualising alignments against the reduced reference. aggregated

Pipeline overview

1. Concatenate input files and generate per read stats

fastcat is used to concatenate input FASTQ files prior to downstream processing of the workflow. It will also output per-read stats including read lengths and average qualities.

You may want to choose which reads are analysed by filtering them using these flags max_len, min_len, min_read_qual, (see the Inputs section for details).

2. Remove host sequences (optional)

We have included an optional filtering step to remove any host sequences that map (using Minimap2) against a provided host reference (e.g. human), which can be a FASTA file or a MMI index. To use this option provide the path to your host reference with the exclude_host parameter. The mapped reads are output in a BAM file and excluded from further analysis.

nextflow run epi2me-labs/wf-metagenomics --fastq test_data/case04/reads.fastq.gz --exclude_host test_data/case04/host.fasta.gz

3. Classify reads taxonomically

There are two different approaches to taxonomic classification:

3.1 Using Kraken2

Kraken2 provides the fastest method for the taxonomic classification of the reads. Then, Bracken is used to provide an estimate of the species (or the selected taxonomic rank) abundance in the sample.

3.1.1 Running wf-metagenomics in real time

The Kraken2 mode can be used in real-time, allowing the workflow to run parallel with an ongoing sequencing run as read data is being produced by the Oxford Nanopore Technologies sequencing instrument. In this case, Kraken2 is used with the Kraken2-server and the user can visualise the classification of reads and species abundances in a real-time updating report.
In real-time mode, the workflow processes new input files as they become available in batches of the specified size. Thus, this option cannot be used with a single fastq as input.

Note: When using the workflow in real-time, the workflow will run indefinitely until a user interrupts the program (e.g with ctrl+c when on the command line). The workflow can be configured to complete automatically after a set number of reads have been analysed using the read_limit variable. Once this threshold has been reached, the program will emit a STOP.fastq.gz file into the fastq directory, which will instruct the workflow to complete. The "STOP.fastq.gz" file is then deleted.

nextflow run epi2me-labs/wf-metagenomics --fastq test_data/case01 --real_time --batch_size 1000 --read_limit 4000

If running the Kraken2 pipeline real_time in a cluster, there are two options to enable the workflow to be able to communicate with the Kraken-server:

  1. Run a Kraken-server separately outside of the workflow.
  2. Submit the workflow job to run on a single node (i.e. running as if on a single local machine).

Notes on CPU resource of Kraken-server and client in the real time workflow The real-time subworkflow uses a server process to handle Kraken2 classification requests. This allows the workflow to persist the sequence database in memory throughout the duration of processing. There are some parameters that may be worth considering to improve the performance of the workflow:

  • port: The option specifies the local network port on which the server and clients will communicate.
  • host: Network hostname (or IP address) for communication between Kraken2 server and clients. (See also external_kraken2 parameter).
  • external_kraken2: Whether a pre-existing Kraken2 server should be used, rather than creating one as part of the workflow. By default the workflow assumes that it is running on a single host computer, and further that it should start its own Kraken2 server. It may be desirable to start a Kraken2 server outside of the workflow (for example to host a large database), in which case this option should be enabled. This option may be used in conjuction with the host option to specify that the Kraken2 server is running on a remote computer.
  • server_threads: Number of CPU threads used by the Kraken2 server for classifying reads.
  • kraken_clients: Number of clients that can connect at once to the Kraken-server for classifying reads. It should not be set to more than 4 fewer than the executor CPU limit.

3.2 Using Minimap2

Minimap2 provides better resolution, but, depending on the reference database used, can take significantly more time. Also, running the workflow with minimap2 does not support real-time analysis.

nextflow run epi2me-labs/wf-metagenomics --fastq test_data/case01 --classifier minimap2

The creation of alignment statistics plots can be enabled with the minimap2_by_reference flag. Using this option produces a table and scatter plot in the report showing sequencing depth and coverage of each reference. The report also contains a heatmap indicating the sequencing depth over relative genomic coordinates for the references with the highest coverage (references with a mean coverage of less than 1% of the one with the largest value are omitted).

In addition, the user can output BAM files in a folder called bams by using the option keep_bam. If the user provides a custom database, the workflow will also output the references with reads mappings, as well as an IGV configuration file. This configuration file allows the user to view the alignments in the EPI2ME Desktop Application in the Viewer tab. Note that the number of references can be reduced using the abundance_threshold option, which will select those references with a number of reads aligned higher than this value. Please, consider that the view of the alignment is highly dependent on the reference selected.

4. Identify Antimicrobial Resistance Genes (AMR) (optional)

The workflow can be used to determine the presence of acquired antimicrobial resistance (AMR) or virulence genes within the dataset. It uses ABRicate to scan reads against a database of AMR/virulence genes.

nextflow run epi2me-labs/wf-metagenomics --fastq path/to/fastq/ --database_set PlusPF-8 --amr

Note: ABRicate can only report the presence of acquired AMR/virulence genes but cannot identify SNP-mediated antimicrobial resistance.

5. Prepare output

The main output of the wf-metagenomics pipeline is the wf-metagenomics-report.html which can be found in the output directory. It contains a summary of read statistics, the taxonomic composition of the sample and some diversity metrics. The results shown in the report can also be customised with several options. For example, you can use abundance_threshold to remove all taxa less prevalent than the threshold from the abundance table. When setting this parameter to a natural number, taxa with fewer absolute counts are removed. You can also pass a decimal between 0.0-1.0 to drop taxa of lower relative abundance. Furthermore, n_taxa_barplot controls the number of taxa displayed in the bar plot and groups the rest under the category ‘Other’.

The workflow output also contains Kraken and bracken reports for each sample. Additionally, the ‘species-abundance.tsv’ is a table with the counts of the different taxa per sample. You can use the flag include_kraken2_assignments to include a per sample TSV file that indicates how each input sequence was classified as well as the taxon that has been assigned to each read. This TSV file will only be output on completion of the workflow and therefore not at all if using the real time option whilst running indefinitely. This option is available in the Kraken2 pipeline.

5.1 Diversity indices

Species diversity refers to the taxonomic composition in a specific microbial community. There are some useful concepts to take into account:

There are three types of biodiversity measures described over a special scale 1, 2: alpha-, beta-, and gamma-diversity.

To provide a quick overview of the alpha-diversity of the microbial community, we provide some of the most common diversity metrics calculated for a specific taxonomic rank 3, which can be chosen by the user with the taxonomic_rank parameter ('D'=Domain,'P'=Phylum, 'C'=Class, 'O'=Order, 'F'=Family, 'G'=Genus, 'S'=Species). By default, the rank is 'S' (species-level). Some of the included alpha diversity metrics are:

H = -\sum_{i=1}^{S}p_i*ln(p_i)
D = \sum_{i=1}^{S}p_i^2
J = H/ln(S)
BP = n_i/N

where ni refers to the counts of the most abundant taxon and N is the total of counts.

S = \alpha * ln(1 + N/\alpha)

where S is the total number of taxa, N is the total number of individuals in the sample. The value of Fisher's $\alpha$ is calculated by iteration.

These indices are calculated by default using the original abundance table (see McMurdie and Holmes5, 2014 and Willis6, 2019). If you want to calculate them from a rarefied abundance table (i.e. all the samples have been subsampled to contain the same number of counts per sample, which is the 95% of the minimum number of total counts), you can download the rarefied table from the report.

The report also includes the rarefaction curve per sample which displays the mean of species richness for a subsample of reads (sample size). Generally, this curve initially grows rapidly, as most abundant species are sequenced and they add new taxa in the community, then slightly flattens due to the fact that 'rare' species are more difficult of being sampled, and because of that is more difficult to report an increase in the number of observed species.

Note: Within each rank, each named taxon is a unique unit. The counts are the number of reads assigned to that taxon. All Unknown sequences are considered as a unique taxon

Troubleshooting

FAQ's

If your question is not answered here, please report any issues or suggestions on the github issues page or start a discussion on the community.

Related blog posts

See the EPI2ME website for lots of other resources and blog posts.