benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 142 forks source link

Taxonomic assignment issue on mock community from Zymo #1753

Closed chrissy005 closed 1 year ago

chrissy005 commented 1 year ago

This is my first time using DADA2. I followed the DADA2 tutorials as well as the inputs from https://github.com/ErnakovichLab/dada2_ernakovichlab to process a mock community sample sequenced for 16S V4 amplicons using Novaseq (paired-end sequencing). The mock community is from Zymo (https://files.zymoresearch.com/protocols/_d6300_zymobiomics_microbial_community_standard.pdf) and contains a total of 8 bacterial strains.

Since these amplicons were sequenced on a Novaseq platform. I have attempted to run the all 5 error models (the traditional DADA2 error model as well as the 4 error models described in https://github.com/ErnakovichLab/dada2_ernakovichlab.

The zipped folder I am attaching contains:

The fastq reads I received from the sequencing facility with barcodes and adapters trimmed were at 200bp instead of the regular 250bp. Not quite sure why this was the case and I was not provided with an explanation for the same. I did attempt a cutadapt primer trimming on the reads and found that none of the sequences contained primers. As the V4 region is completely overlapping , I truncated both the forward and reverse reads at 150bp.

All the steps of the DADA2 pipeline did run successfully on the data. However, I think I may be facing some issues with the taxonomic assignment part in DADA2.

I have tried assigning taxonomy in 3 different ways: 1) The naive Bayesian classifier method and the SILVA v138.1 database.

With this method I only get Taxonomic assignment till the Genus level and only ASVs matching 6 out of the 8 bacterial strains across all 5 error models. No species assignment with any. I am assuming this could be a database problem as SILVA may not have the exact reference sequence of the rDNA copy sequenced here.

2) The naive Bayesian classifier method and the reference sequences provided by Zymo.

I formatted the Zymo provided references https://s3.amazonaws.com/zymo-files/BioPool/ZymoBIOMICS.STD.refseq.v2.zip following the format custom database in https://benjjneb.github.io/dada2/training.html. I have included this file in the zipped folder and is titled zymo_mockspecies.fasta.

With this method, I only get taxonomic assignments of the ASVs till the Family level. ASVs matching 5-6 strains out of 8 across all 5 error models used to infer the ASVs.

Absolutely nothing at the genus and species levels. I hope this is not due to a database formatting issue.

This was the most shocking as all the exact species references are provided as reference sequences.

3) IDTAXA from the Decipher package using SILVA SSU r138 (modified) database (http://www2.decipher.codes/Downloads.html).

With this method I get taxonomic assignments till the genus level but only for ASVs matching 2-3 bacterial strains out of 8. once again nothing at the species level.

Puzzled by all the above I extracted the FASTA sequences of the ASVs inferred and merged using the traditional DADA2 error model (t150.nonchimeric.mock.T.fasta) and mapped it to the Zymo reference sequences on Geneious. Based on the mapping, I do get 100% pairwise identities of 6 out of the 8 bacterial strains in the mock community. I am also including this file showing the mapping results in Geneious (ASvs.mappedto.ZymoRef.geneious). Not sure if you can open this folder. image

I also attempted the "Evaluating DADA2’s accuracy on the mock community" in your tutorial https://benjjneb.github.io/dada2/tutorial_1_8.html. But ran into an error with the list argument even after unlisting the sequences."Error in sum(sapply(names(unqs.mock.nt.T), function(x) any(grepl(x, mock.ref)))) : invalid 'type' (list) of argument". I am also attaching the R markdown file to this (zymo.mock.t150.Rmd)

If the ASV sequences can be mapped at 100% pairwise identity to the Zymo references, then I should be able to get taxonomic assignment to the species level for 6 of the bacterial strains using naive Bayesian classifier method and the reference sequences provided by Zymo right?

Noot sure if if it is something in my scripts that is preventing the other 2 bacterial strains from being detected or a problem with the mock community itself? Do let me know if you find anything weird in the way I have processed the data.

I await your help on this matter and looking forward to hear from you.

trim.150.zip

chrissy005 commented 1 year ago

In addition to my questions above, has anyone using Novaseq sequencing found "the most appropriate error model"? Or rather what is the current workaround for this kind of data in the community?

benjjneb commented 1 year ago

The naive Bayesian classifier method (as implemented in assignTaxonomy) is not really designed to make species-level identifications to short-read 16S data. In most cases, one can't make unambiguous species-level IDs from V4 16S against a general bacterial database, as there isn't enough information to distinguish between related species. So the results you describe in (1) sound exactly as expected.

We do provide the assignSpecies/addSpecies method the try to do species ID on short-read 16S by exact matching. This requires a different format of the reference database. Is this something you have tried to do?

For a mock community, because you have a very strong prior knowledge about what should be in there, you could use the reference 16S sequences and and assignSpecies to get species IDs for the mock community members. That said, the zymo.mockspecies.fasta file you created is not in the correct format for that (nor is it in the correct format for assignTaxonomy). IN particular, the id lines need to be formatted in the correct way for use with those functions. See information here: https://benjjneb.github.io/dada2/training.html#formatting-custom-databases

chrissy005 commented 1 year ago

I did try both assigntaxonomy and addSpecies consecutively (on the same object), just as it is indicated in the tutorial. as follows: taxa.nb.T <- assignTaxonomy(seqtab.nochim.T, "silva_nr99_v138.1_train_set.fa", tryRC = TRUE, multithread=TRUE) taxa.nb.T <- addSpecies(taxa.nb.T, "silva_species_assignment_v138.1.fa")

Should I store the results of the addSpecies function to another object? If I understand correctly the addSpecies would add the Species column right?

I will try the assignSpecies function as you suggested with the reference 16S sequences.

I think the format got changed as I used the file directly from the program. However on the server where I run the R codes the zymo.mockspecies .fasta are in the following formats which I formatted based on the link you shared:

assignTaxonomy used references with Id lines as follows:

Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus1; TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCTG Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus2; TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCCC Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus3; TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCCC

addSpecies used references with Id lines as follows:

B-354 Bacillus subtilis1 TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCCC B-354 Bacillus subtilis2 TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCCC B-354 Bacillus subtilis3 TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCCC

benjjneb commented 1 year ago

The addition of the numbers on the terminal taxonomic levels (genus or species) is not an appropriate format for using assignTaxonomy or assignSpecies. When formatted in this way, the algorithms interpret every one of these number instances of e.g. B. subtilis as a different species. This will result in an NA assignment much of the time, as the query will be matched to multiple entries of B. subtilis X/Y/Z, which will then be interpreted as ambiguous assignment.

chrissy005 commented 1 year ago

Oh right. So how should the id lines be formatted for the different rDNA copy references of the same species/sub-species? Could I append an alphabet instead as shown below?

assignTaxonomy used references with Id lines as follows:

Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillusa; TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCTG Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillusb; TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCCC

addSpecies used references with Id lines as follows:

B-354 Bacillus subtilisa TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCCC B-354 Bacillus subtilisb TTTATCGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACAGATGGGAGCTTGCTCCC

benjjneb commented 1 year ago

You don't append anything to the species name (or genus name).

The species/genus is the species/genus, no matter which reference strain or which allele in a reference strain the sequence came from.

chrissy005 commented 1 year ago

So how does one identify as to which reference sequence the obtained ASV had matched to. Is this purely by altering the ‘accession no’ then? e.g. B-254 to B-254.1. Two different reference sequences cannot have the same ID line right?Apologies for going back and forth on this.Regards ChristalineOn Jun 15, 2023, at 8:16 AM, Benjamin Callahan @.***> wrote: You don't append anything to the species name (or genus name). The species/genus is the species/genus, no matter which reference strain or which allele in a reference strain the sequence came from.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

benjjneb commented 1 year ago

assignSpecies and assignTaxonomy don't return the specific reference sequences that were matched, and aren't intended to. They attempt to assign taxonomy based on the collection of reference sequences in the reference database.

Two different reference sequences cannot have the same ID line right?

Yes they can. And especially in assignTaxonomy there should be multiple entries with the same ID line, if there are multiple distinct sequences that belong to the same terminal taxonomic level.

chrissy005 commented 1 year ago

Thank you very much for the clarification. Will attempt taxonomic assignment after cleaning up the Ids.Regards ChristalineOn Jun 15, 2023, at 9:01 AM, Benjamin Callahan @.***> wrote: assignSpecies and assignTaxonomy don't return the specific reference sequences that were matched, and aren't intended to. They attempt to assign taxonomy based on the collection of reference sequences in the reference database.

Two different reference sequences cannot have the same ID line right?

Yes they can. And especially in assignTaxonomy there should be multiple entries with the same ID line, if there are multiple distinct sequences that belong to the same terminal taxonomic level.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>