This document describes our procedure for processing 16S amplicon libraries using the 515F-Y/926R primer set (Parada et al. 2015). Amplicon sequence variants are generated with DADA2 (Callahan et al. 2016). For annotation, we primarily use the SILVA database but supplement with PhytoREF for plastid sequences.
QIIME2 visualizations can be viewed here.
We normally but not exclusively generate these libraries from DNA and RNA extracted from Sterivex filters. See our automated extraction protocols here:
Activate your conda environment for QIIME2 and cd to your working directory
source activate qiime2-2019.10
cd $working_directory
This assumes your data are provided as demultiplexed paired-end fastq files with PHRED 33 encoded quality scores. If your data are in a different format, see the QIIME2 documentation on importing data.
First, generate a fastq manifest file which maps sample IDs to the full path of your fastq files (compressed as fastq.gz is also fine). The manifest is a tab-delimited file (.tsv) with the following headers:
sample-id forward-absolute-filepath reverse-absolute-filepath
Import the files and validate the output.
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path manifest.tsv \
--output-path paired-end-demux.qza \
--input-format PairedEndFastqManifestPhred33V2
qiime tools validate paired-end-demux.qza
A visualization of the imported sequences is often helpful.
qiime demux summarize \
--i-data paired-end-demux.qza \
--o-visualization paired-end-demux.qzv
Remove the primers from reads with cutadapt.
Parameter notes:
qiime cutadapt trim-paired \
--i-demultiplexed-sequences paired-end-demux.qza \
--p-cores 1 \
--p-adapter-f ^GTGYCAGCMGCCGCGGTAA...AAACTYAAAKRAATTGRCGG \
--p-adapter-r ^CCGYCAATTYMTTTRAGTTT...TTACCGCGGCKGCTGRCAC \
--p-error-rate 0.1 \
--p-overlap 3 \
--verbose \
--o-trimmed-sequences paired-end-demux-trimmed.qza
Visualize trimmed sequences:
qiime demux summarize \
--i-data paired-end-demux-trimmed.qza \
--p-n 100000 \
--o-visualization paired-end-demux-trimmed.qzv
Generate and quantify amplicon sequence variants ASVs with DADA2
Parameter notes:
mkdir asvs
qiime dada2 denoise-paired \
--i-demultiplexed-seqs paired-end-demux-trimmed.qza \
--p-n-threads 0 \
--p-trunc-q 2 \
--p-trunc-len-f 219 \
--p-trunc-len-r 194 \
--p-max-ee-f 2 \
--p-max-ee-r 4 \
--p-n-reads-learn 1000000 \
--p-chimera-method pooled \
--o-table ./asvs/table-dada2.qza \
--o-representative-sequences ./asvs/rep-seqs-dada2.qza \
--o-denoising-stats ./asvs/stats-dada2.qza
Generate and examine the DADA2 stats. You should be retaining most (>50%) of your sequences. If you are losing a large number of sequences at a certain DADA2 step, you will need to troubleshoot and adjust your DADA2 parameters accordingly.
qiime metadata tabulate \
--m-input-file ./asvs/stats-dada2.qza \
--o-visualization ./asvs/stats-dada2.qzv
If you have multiple sequencing runs, proceed with merging the table and sequences from separate DADA2 runs. If not, proceed to the summarization and export steps.
Merge (add additional lines for --i-tables and --i-data as needed):
qiime feature-table merge \
--i-tables ./asvs/run1_table-dada2.qza \
--i-tables ./asvs/run2_table-dada2.qza \
--o-merged-table merged_table-dada2.qza
qiime feature-table merge-seqs \
--i-data ./asvs/run1_rep-seqs-dada2.qza \
--i-data ./asvs/run2_rep-seqs-dada2.qza \
--o-merged-data merged_rep-seqs.qza
Summarize and export:
qiime tools export \
--input-path merged_table-dada2.qza \
--output-path asv_table
biom convert -i asv_table/feature-table.biom -o asv_table/asv-table.tsv --to-tsv
qiime tools export \
--input-path merged_rep-seqs.qza \
--output-path asvs
qiime feature-table tabulate-seqs \
--i-data merged_rep-seqs.qza \
--o-visualization merged_rep-seqs.qzv
If you have metadata, you can include it here:
qiime feature-table summarize \
--i-table merged_table-dada2.qza \
--o-visualization merged_table-dada2.qzv \
--m-sample-metadata-file metadata.tsv
QIIME2 compatable files can be obtained here.
Extract the region between the primers and and train the classifier:
qiime feature-classifier extract-reads \
--i-sequences ./db/silva/silva-138-99-seqs.qza \
--p-f-primer GTGYCAGCMGCCGCGGTAA \
--p-r-primer CCGYCAATTYMTTTRAGTTT \
--o-reads ./db/silva/silva-138-99-extracts.qza
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ./db/silva/silva-138-99-extracts.qza \
--i-reference-taxonomy ./db/silva/silva-138-99-tax.qza \
--o-classifier ./db/silva/silva-138-99-classifier.qza
Classify the ASVs. The option --p-n-jobs -1 uses the all CPUs. Adjust accordingly:
qiime feature-classifier classify-sklearn \
--p-n-jobs -1 \
--i-classifier ./db/silva/silva-138-99-classifier.qza \
--i-reads ./asvs/merged_rep-seqs.qza \
--o-classification silva_tax_sklearn.qza
qiime tools export \
--input-path silva_tax_sklearn.qza \
--output-path asv_tax_dir
mv asv_tax_dir/taxonomy.tsv asv_tax_dir/silva_taxonomy.tsv
Create a simple fasta file and tab-delimited taxonomy file that links the fasta IDs to the taxonomic string. For example:
$ head PhytoRef_with_taxonomy_simple.fasta
>202
GGCATGCTTAACACATGCAAGTTGAACGAAGACAAAAATTAGGCTTGCCAAAGTTTTGGA
CTTAGTAGCGGACGGGTGAGTAACGCGTAAGAATCTACCCCTAGGAAAAGCATAACAACT
GGAAACGGTTGCTAATACTTTATATGCTGAGAAGTTAAATGGGTTTCCGCCTAGGGATGA
$ head PhytoRef_taxonomy.tsv
202 HE610155|Eukaryota|Archaeplastida|Chlorophyta|Ulvophyceae|Ulvophyceae_X|Ulvophyceae_XX|Ulvophyceae_XXX|Ulvophyceae_XXXX|Desmochloris|Desmochloris halophila
803 AF514849|Eukaryota|Stramenopiles|Ochrophyta|Bacillariophyta|Bacillariophyceae|Naviculales|Naviculales_X|Naviculaceae|Haslea|Haslea crucigera
819 X80390|Eukaryota|Hacrobia|Haptophyta|Prymnesiophyceae|Prymnesiophyceae_X|Coccolithales|Coccolithales_X|Hymenomonadaceae|Ochrosphaera|Ochrosphaera neapolitana
Import the sequence data, select the region between the primers, and train the classifier:
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path ./db/phytoref/PhytoRef_with_taxonomy_simple.fasta \
--output-path ./db/phytoref/phytoref.qza
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path ./db/phytoref/PhytoRef_taxonomy.tsv \
--output-path ./db/phytoref/phytoref_tax.qza
qiime feature-classifier extract-reads \
--i-sequences ./db/phytoref/phytoref.qza \
--p-f-primer GTGYCAGCMGCCGCGGTAA \
--p-r-primer CCGYCAATTYMTTTRAGTTT \
--o-reads ./db/phytoref/phytoref_extracts.qza
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ./db/phytoref/phytoref_extracts.qza \
--i-reference-taxonomy ./db/phytoref/phytoref_tax.qza \
--o-classifier ./db/phytoref/phytoref_classifier.qza
Classify ASVs:
qiime feature-classifier classify-sklearn \
--p-n-jobs -1 \
--i-classifier ./db/phytoref/phytoref_classifier.qza \
--i-reads merged_rep-seqs.qza \
--o-classification phytoref_tax_sklearn.qza
qiime tools export \
--input-path phytoref_tax_sklearn.qza \
--output-path asv_tax_dir
mv asv_tax_dir/taxonomy.tsv asv_tax_dir/phytoref_taxonomy.tsv
From here, you should be able to do additional analyses within QIIME2. If you would like to create an tab-delimited file with your ASV IDs, sample counts, and taxonomic classification, run the following in R:
# Read in asv table
asv_table <- read.delim("asv_table/asv-table.tsv", header=TRUE, row.names=NULL, stringsAsFactors=FALSE, na.strings = "n/a", skip = 1)
names(asv_table)[1] <- "Feature.ID"
silva <- read.delim('./asv_tax_dir/silva_taxonomy.tsv', header=TRUE, row.names=NULL)
names(silva)[2:3] <- paste("silva", names(silva)[2:3], sep="_")
phytoref <- read.delim('./asv_tax_dir/phytoref_taxonomy.tsv', header=TRUE, row.names=NULL)
names(phytoref)[2:3] <- paste("phytoref", names(phytoref)[2:3], sep="_")
output <- merge(asv_table, silva, by="Feature.ID")
output <- merge(output, phytoref, by="Feature.ID")
write.table(output, file="asv_count_tax.tsv", quote=FALSE, sep="\t", row.names=FALSE)