benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Doubtful taxonomy assignment #1732

Closed Varun9599 closed 1 year ago

Varun9599 commented 1 year ago

Hello! I have been using SILVA v. 138 for taxonomic assignment. Since past few runs, I have been observing ASVs named "Lactobacillus hominis" prominently in the data. Although when I BLAST the corresponding sequences, it shows totally different bacteria. Is this a normal phenomenon? What steps should I take to ensure accurate taxonomic assignment?

benjjneb commented 1 year ago

Can you provide an example sequence that is getting this assignment?

Varun9599 commented 1 year ago

Thanks for the reply! This is one of such sequences in my ASV table: CCTGTTCGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCCCTTCTGCACTCAAGTTTGACAGTTTCCAAAGCGAACTATGGTTGAGCCACAGCCTTTAACTTCAGACTTATCAAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGGACAACGCTCGGGACCTACGT

salix-d commented 1 year ago

I tried and can't reproduce your results:

dada2::assignTaxonomy("CCTGTTCGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCCCTTCTGCACTCAAGTTTGACAGTTTCCAAAGCGAACTATGGTTGAGCCACAGCCTTTAACTTCAGACTTATCAAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGGACAACGCTCGGGACCTACGT", 
"~/SILVA138.1/silva_nr99_v138.1_train_set.fa.gz", 
tryRC = TRUE)
 Kingdom        Phylum      Class              Order              Family           Genus  
"Bacteria"  "Firmicutes"  "Bacilli"  "Lactobacillales"  "Streptococcaceae"  "Streptococcus"                                        

and for species

dada2::assignSpecies("CCTGTTCGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCCCTTCTGCACTCAAGTTTGACAGTTTCCAAAGCGAACTATGGTTGAGCCACAGCCTTTAACTTCAGACTTATCAAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGGACAACGCTCGGGACCTACGT", 
"~/SILVA138.1/silva_species_assignment_v138.1.fa.gz", 
tryRC = TRUE)                                                                                                                                                                                                                                                             Genus   Species
     NA          NA 
Varun9599 commented 1 year ago

@salix-d Initially I was using the SILVA train set titled "silva_nr99_v138.1_wSpecies_train_set.fa". This SILVA train set is annotating a lot of species, such as the one I mentioned. When I repeated the taxa assignment with the regular train set, I got the same taxa assignment as you!

salix-d commented 1 year ago

@Varun9599 Aaah!

Still a bit weird to get such a difference 🤔, but if I remember correctly, using the train set without species and then assigning species is what's suggested.

Another thing you could look at is the boot results. I suspect the boot result for the genera using the train set without species is stronger than the one using the train set with species.

Since the set with species does assign some species that sometimes have a stronger boot result, I made myself a function to merge the two result matrices while keeping the best assignment.

salix-d commented 1 year ago

I tried using the same train set and it gives me

dada2::assignTaxonomy("CCTGTTCGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCCCTTCTGCACTCAAGTTTGACAGTTTCCAAAGCGAACTATGGTTGAGCCACAGCCTTTAACTTCAGACTTATCAAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGGACAACGCTCGGGACCTACGT",
                      "~/SILVA138.1/silva_nr99_v138.1_wSpecies_train_set.fa.gz", 
                      tryRC = TRUE)
    Kingdom     Phylum   Class           Order           Family         Genus    Species 
1: Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus salivarius

Where you using tryRC = TRUE?

benjjneb commented 1 year ago

Confirming that I see the same results as reported by both of you.

Assignment to Lactobacillus hominis when tryRC=FALSE (the default), and assignment to Streptococcus salivarius when tryRC=TRUE.

So the issue here is that the sequence is in the reverse complement orientation to the reference database, the default operation of assignTaxonomy is to not check the reverse copmlement orientation (that is, tryRC=FALSE by default) and the weakness of the naive Bayesian classifier method to over-classification when suitable outgroup sequences aren't present.

So, for @Varun9599 the fix is easy, just add tryRC=TRUE to your assignTaxonomy calls. The question for me is whether that should just be the default, at the cost of 2x the computation time.

Varun9599 commented 1 year ago

Thank you @salix-d and @benjjneb! I was using the default parameters. tryRC=TRUE is generating better results! Just for clarification, is it preferable to use the train set without species and add species subsequently to avoid the inaccurate annotations?

benjjneb commented 1 year ago

Just for clarification, is it preferable to use the train set without species and add species subsequently to avoid the inaccurate annotations?

Our recommendations for short-read 16S is assignTaxonomy without species, then addSpecies to augment to the species level. For long-read 16S we recommend assignTaxonomy with species.

This is because the naive Bayesian classifier (which is the method implemented in assignTaxonomy) can be overly permissive in assigning to the species-level with short-read data, thus the different method (exact matching) for that taxonomic level as implemented in assignSpecies/addSpecies.

Varun9599 commented 1 year ago

@benjjneb Thank you!