allenlab / QIIME2_16S_ASV_protocol

13 stars 4 forks source link

Protocol for processing 16S sequences in QIIME2

This document describes our procedure for processing 16S amplicon libraries using the 515F-Y/926R primer set (Parada et al. 2015). Amplicon sequence variants are generated with DADA2 (Callahan et al. 2016). For annotation, we primarily use the SILVA database but supplement with PhytoREF for plastid sequences.

QIIME2 visualizations can be viewed here.

We normally but not exclusively generate these libraries from DNA and RNA extracted from Sterivex filters. See our automated extraction protocols here:

Requirements

Important Notes

Start

Activate your conda environment for QIIME2 and cd to your working directory

source activate qiime2-2019.10
cd $working_directory

Import files

This assumes your data are provided as demultiplexed paired-end fastq files with PHRED 33 encoded quality scores. If your data are in a different format, see the QIIME2 documentation on importing data.

First, generate a fastq manifest file which maps sample IDs to the full path of your fastq files (compressed as fastq.gz is also fine). The manifest is a tab-delimited file (.tsv) with the following headers:

sample-id forward-absolute-filepath reverse-absolute-filepath

Import the files and validate the output.

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path manifest.tsv \
  --output-path paired-end-demux.qza \
  --input-format PairedEndFastqManifestPhred33V2

qiime tools validate paired-end-demux.qza

A visualization of the imported sequences is often helpful.

qiime demux summarize \
  --i-data paired-end-demux.qza \
  --o-visualization paired-end-demux.qzv

Trim reads

Remove the primers from reads with cutadapt.

Parameter notes:

qiime cutadapt trim-paired \
    --i-demultiplexed-sequences paired-end-demux.qza  \
    --p-cores 1 \
    --p-adapter-f ^GTGYCAGCMGCCGCGGTAA...AAACTYAAAKRAATTGRCGG \
    --p-adapter-r ^CCGYCAATTYMTTTRAGTTT...TTACCGCGGCKGCTGRCAC \
    --p-error-rate 0.1 \
    --p-overlap 3 \
    --verbose \
    --o-trimmed-sequences paired-end-demux-trimmed.qza

Visualize trimmed sequences:

qiime demux summarize \
    --i-data paired-end-demux-trimmed.qza \
    --p-n 100000 \
    --o-visualization paired-end-demux-trimmed.qzv

Denoise with DADA2

Generate and quantify amplicon sequence variants ASVs with DADA2

Parameter notes:

mkdir asvs

qiime dada2 denoise-paired \
    --i-demultiplexed-seqs paired-end-demux-trimmed.qza \
    --p-n-threads 0 \
    --p-trunc-q 2 \
    --p-trunc-len-f 219 \
    --p-trunc-len-r 194 \
    --p-max-ee-f 2 \
    --p-max-ee-r 4 \
    --p-n-reads-learn 1000000 \
    --p-chimera-method pooled \
    --o-table ./asvs/table-dada2.qza \
    --o-representative-sequences ./asvs/rep-seqs-dada2.qza \
    --o-denoising-stats ./asvs/stats-dada2.qza

Generate and examine the DADA2 stats. You should be retaining most (>50%) of your sequences. If you are losing a large number of sequences at a certain DADA2 step, you will need to troubleshoot and adjust your DADA2 parameters accordingly.

qiime metadata tabulate \
  --m-input-file ./asvs/stats-dada2.qza \
  --o-visualization ./asvs/stats-dada2.qzv

Merge and summarize denoised data

If you have multiple sequencing runs, proceed with merging the table and sequences from separate DADA2 runs. If not, proceed to the summarization and export steps.

Merge (add additional lines for --i-tables and --i-data as needed):

qiime feature-table merge \
  --i-tables ./asvs/run1_table-dada2.qza \
  --i-tables ./asvs/run2_table-dada2.qza \
  --o-merged-table merged_table-dada2.qza

qiime feature-table merge-seqs \
  --i-data ./asvs/run1_rep-seqs-dada2.qza \
  --i-data ./asvs/run2_rep-seqs-dada2.qza \
  --o-merged-data merged_rep-seqs.qza

Summarize and export:

qiime tools export \
  --input-path merged_table-dada2.qza \
  --output-path asv_table

biom convert -i asv_table/feature-table.biom -o asv_table/asv-table.tsv --to-tsv
qiime tools export \
  --input-path merged_rep-seqs.qza \
  --output-path asvs

qiime feature-table tabulate-seqs \
  --i-data merged_rep-seqs.qza \
  --o-visualization merged_rep-seqs.qzv

If you have metadata, you can include it here:

qiime feature-table summarize \
  --i-table merged_table-dada2.qza \
  --o-visualization merged_table-dada2.qzv \
  --m-sample-metadata-file metadata.tsv

Taxonomic annotation

Silva

QIIME2 compatable files can be obtained here.

Extract the region between the primers and and train the classifier:

qiime feature-classifier extract-reads \
  --i-sequences ./db/silva/silva-138-99-seqs.qza \
  --p-f-primer GTGYCAGCMGCCGCGGTAA \
  --p-r-primer CCGYCAATTYMTTTRAGTTT \
  --o-reads ./db/silva/silva-138-99-extracts.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ./db/silva/silva-138-99-extracts.qza \
  --i-reference-taxonomy ./db/silva/silva-138-99-tax.qza \
  --o-classifier ./db/silva/silva-138-99-classifier.qza

Classify the ASVs. The option --p-n-jobs -1 uses the all CPUs. Adjust accordingly:

qiime feature-classifier classify-sklearn \
  --p-n-jobs -1 \
  --i-classifier ./db/silva/silva-138-99-classifier.qza \
  --i-reads ./asvs/merged_rep-seqs.qza \
  --o-classification silva_tax_sklearn.qza

qiime tools export \
  --input-path silva_tax_sklearn.qza \
  --output-path asv_tax_dir

mv asv_tax_dir/taxonomy.tsv asv_tax_dir/silva_taxonomy.tsv

PhytoREF

Create a simple fasta file and tab-delimited taxonomy file that links the fasta IDs to the taxonomic string. For example:

$ head PhytoRef_with_taxonomy_simple.fasta
>202
GGCATGCTTAACACATGCAAGTTGAACGAAGACAAAAATTAGGCTTGCCAAAGTTTTGGA
CTTAGTAGCGGACGGGTGAGTAACGCGTAAGAATCTACCCCTAGGAAAAGCATAACAACT
GGAAACGGTTGCTAATACTTTATATGCTGAGAAGTTAAATGGGTTTCCGCCTAGGGATGA

$ head PhytoRef_taxonomy.tsv
202     HE610155|Eukaryota|Archaeplastida|Chlorophyta|Ulvophyceae|Ulvophyceae_X|Ulvophyceae_XX|Ulvophyceae_XXX|Ulvophyceae_XXXX|Desmochloris|Desmochloris halophila
803     AF514849|Eukaryota|Stramenopiles|Ochrophyta|Bacillariophyta|Bacillariophyceae|Naviculales|Naviculales_X|Naviculaceae|Haslea|Haslea crucigera
819     X80390|Eukaryota|Hacrobia|Haptophyta|Prymnesiophyceae|Prymnesiophyceae_X|Coccolithales|Coccolithales_X|Hymenomonadaceae|Ochrosphaera|Ochrosphaera neapolitana

Import the sequence data, select the region between the primers, and train the classifier:

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path ./db/phytoref/PhytoRef_with_taxonomy_simple.fasta  \
  --output-path ./db/phytoref/phytoref.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path ./db/phytoref/PhytoRef_taxonomy.tsv \
  --output-path ./db/phytoref/phytoref_tax.qza

qiime feature-classifier extract-reads \
  --i-sequences ./db/phytoref/phytoref.qza \
  --p-f-primer GTGYCAGCMGCCGCGGTAA \
  --p-r-primer CCGYCAATTYMTTTRAGTTT \
  --o-reads ./db/phytoref/phytoref_extracts.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ./db/phytoref/phytoref_extracts.qza \
  --i-reference-taxonomy ./db/phytoref/phytoref_tax.qza \
  --o-classifier ./db/phytoref/phytoref_classifier.qza

Classify ASVs:

qiime feature-classifier classify-sklearn \
  --p-n-jobs -1 \
  --i-classifier ./db/phytoref/phytoref_classifier.qza \
  --i-reads merged_rep-seqs.qza \
  --o-classification phytoref_tax_sklearn.qza

qiime tools export \
  --input-path phytoref_tax_sklearn.qza \
  --output-path asv_tax_dir

mv asv_tax_dir/taxonomy.tsv asv_tax_dir/phytoref_taxonomy.tsv

Final output table

From here, you should be able to do additional analyses within QIIME2. If you would like to create an tab-delimited file with your ASV IDs, sample counts, and taxonomic classification, run the following in R:

# Read in asv table
asv_table <- read.delim("asv_table/asv-table.tsv", header=TRUE, row.names=NULL, stringsAsFactors=FALSE, na.strings = "n/a", skip = 1)
names(asv_table)[1] <- "Feature.ID"

silva <- read.delim('./asv_tax_dir/silva_taxonomy.tsv', header=TRUE, row.names=NULL)
names(silva)[2:3] <- paste("silva", names(silva)[2:3], sep="_")
phytoref <- read.delim('./asv_tax_dir/phytoref_taxonomy.tsv', header=TRUE, row.names=NULL)
names(phytoref)[2:3] <- paste("phytoref", names(phytoref)[2:3], sep="_")

output <- merge(asv_table, silva, by="Feature.ID")
output <- merge(output, phytoref, by="Feature.ID")

write.table(output, file="asv_count_tax.tsv", quote=FALSE, sep="\t", row.names=FALSE)