benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

assigntaxonomy confidence? #976

Closed phorve closed 4 years ago

phorve commented 4 years ago

Hey all,

If this has already been discussed, feel free to point me in that direction! I looked through and couldn't find anything but it's possible I missed it!

In discussing species/genus level assignment with a colleague, we started trying to figure out what the minimum sequence length from 16s you would need to be confident in the assignment that was made (i.e Kingdom, phylum, etc...). For example, if you have poor quality reads, only use the forwards, and have to trim 16s V3/V4 down to 125 nt, could you still be confident in genus level assignment? We've been analyzing some older data in prep for a dada2 run-through, and found some data that matched this description, but don't know whether we can trust it or not. And at what point can we trust it? Is there some guidance as to where these cutoffs should be made?

benjjneb commented 4 years ago

assignTaxonomy just implements the naive Bayesian classifier method, it is not an original method. The paper describing the method it implements is here, which includes quite a bit of benchmarking of accuracy including on different read lenghts and at different taxonomic levels: Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy

Depending on how deep you want to get into these weeds, some additional good papers looking at taxonomic assignment in 16S data are:

IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences

Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences