Closed wolass closed 7 years ago
You've hit on a fairly deep problem and not one that I have an easy answer for, although I do want to point out the tax_glom
function in the phyloseq R package (which interfaces easily w/ dada2 output). tax_glom
will lump together all variants that share a taxonomic assignment at the specified level.
However, that doesn't fix the problem you're seeing of incomplete identification at the species level. And, in some sense, there is no total answer to that problem because: (1) 16S gene regions often don't contain enough information to unambiguously assign to species level, and (2) the reference databases we use are incomplete.
The assignSpecies
method is a conservative approach focused on leveraging the accuracy of DADA2, it requires an exact match to a classified reference 16S sequence to assign to the species level (note also the allowMultiple
option). Classifying non-matching sequences is quite difficult, as the substitution rates in the 16S gene and the completeness of the reference databases both vary greatly among different bacterial clades.
If interested, a couple recent discussions of different approaches to species assignment from 16S data:
SPINGO: a rapid species-classifier for microbial amplicon sequences Validating taxonomy classifiers
As for me, like yourself I'm currently doing assignment by hand for particular taxonomic groups of interest. assignSpecies
helps, but using broader databases (eg. BLAST against nt) and domain knowledge adds significantly more information than is available to the general purpose assignment methods.
Closing as a broader issue that we are interested in but don't currently have plans on solving.
I agree with everything @benjjneb said, including the motivation for closure of this issue.
I want to note for other users who might be similarly confused from legacy notions about OTUs from earlier work in the field. The OP is not alone in this confusion, when they state:
" The problem is that... there are sometimes too many.
I get multiple "OTUs" of the same bacteria species.
You can clearly see that some of these bacteria have HIGHLY similar sequences. Therefore they are probably THE SAME SPECIES.
What I think is needed is a better automatic clustering of similar "OTUs" that takes into the account the information about species. "
The theme of this problem statement, restated, is: Does DADA2 return sequence features that are species?.
The answer to this is:
No.
DADA2 is not an OTU method. It is not attempting to cluster similar sequences into taxonomically-motivated groups. Instead, it infers what exact sequences are present in your PCR-amplified DNA sample.
Because of how sequence variation behaves relative to species distinctions the following things are all plausible and common
These are not mistakes, but real and expected biological phenomena. Some of the limitations are derived from known limitations to our reference databases and the algorithms used to annotate these sequences (but probably not a mistake by DADA2). The biggest hurdle in what the OP describes is simply a mistake of applying the old "OTU" way of describing the problem -- which approximately equivocates species and OTU -- with the exact sequence features returned by DADA2.
On the other hand, the OP seems interested in distinguishing different strains from skin. DADA2 is a very good choice for this problem and the data available, for reasons we've described and demonstrated in several places elsewhere. The post-DADA2 interpretation described here needs to incorporate the possibilities afforded by DADA2, which are relatively new and separate from what is often described for OTUs.
Hope that helps. Thanks for the post.
Dear @joey711 and @benjjneb Thank you for taking the time to give such a thorough response. @joey711 - I fully agree, although I'm aware that these are not species or even strains your extensive response helped me and will be a guide for others who start their journey with dada2.
I have used both pipelines: QIIME (with UCLUST) and with this method - the sequences of S. aureus and S. epidermidis (which possess similar sequences) were clustered together, therefore I was wondering why I am seeing so little of S. epidermidis in my data (which is known to be relatively abundant on skin).
After using the dada2 everything became clear and the sample types in my data are now much more clearly defined by different and consistent DSV composition.
I'd like to use this last paragraph to reflect on what I have seen on this repository: (especially in this thread: #62 )
Thanks @wolass , very nice of you to say. This is a concern that those in computational and informatics disciplines with biological applications have been struggling with for a while, and in some cases have had pretty good success by associating a peer-reviewed publication to a server or repo. For instance, the content of the phyloseq article in PLoS ONE does not justify its number of views or citations, and I think everyone understands that those citations/views actually reflect the utility of the software package, and the work that continued well after the article was published. We agree on the deficiencies in that approach, but it does help bridge the "gap" while things evolve.
I like the dada2 approach and I think it greatly enhances how we evaluate NGS data.
I understand how the dada2 algorithm makes unsupervised "OTUs"/sequence variants or whatever we call them. The problem is that... there are sometimes too many.
I was investigating skin microbiome. After performing the dada2 approach (according to doi: 10.12.688/f1000research.8986.2) I get multiple "OTUs" of the same bacteria species. I only have 45 samples and when one variant shown in less than 3 samples - it is discarded according to the pipeline as it does not meet the "PREVALENCE THRESHOLD".
Please look at the figure below.
This is the tree after assigning Species -level Taxonomy. Some of them aligned to a reference sequence and some did not.
You can clearly see that some of these bacteria have HIGHLY similar sequences. Therefore they are probably THE SAME SPECIES.
And that is great if we want to conduct STRAIN LEVEL research. But what if we want to just focus on the species level? We would discard most of these and I think that's a waste. So lets try to make them count into our analysis!
But... If I would agglomerate these together using distances (even a small one) i would merge S. aureus and S. epidermidis together...
Differences in S. aureus and S. epidermidis are crucial for the analysis as one is a pathogen and the latter a microbiota - and one prevents the other from expanding. So I need this classification.
I ended up clustering them together by hand. But there are probably other highly similar "OTUs" out there.
What I think is needed is a better automatic clustering of similar "OTUs" that takes into the account the information about species. In the way that It clusters similars together (according to a distance) but prevents clustering if there is already a species level aligned from a previous step.
Is this possible to do?
Or maybe we could assume SPECIES NAME according to the clustering based on distance? how I did it by hand in the picture above? Specifying a species name would also preserve the information that these are indeed same species but probably different strains and can add value to the analysis.
Thanks for all the work you share!