benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Problem with multiple clusters of the same bacteria - too high resolution. #158

Closed wolass closed 7 years ago

wolass commented 7 years ago

I like the dada2 approach and I think it greatly enhances how we evaluate NGS data.

I understand how the dada2 algorithm makes unsupervised "OTUs"/sequence variants or whatever we call them. The problem is that... there are sometimes too many.

I was investigating skin microbiome. After performing the dada2 approach (according to doi: 10.12.688/f1000research.8986.2) I get multiple "OTUs" of the same bacteria species. I only have 45 samples and when one variant shown in less than 3 samples - it is discarded according to the pipeline as it does not meet the "PREVALENCE THRESHOLD".

Please look at the figure below.

fig2clustered part

This is the tree after assigning Species -level Taxonomy. Some of them aligned to a reference sequence and some did not.

You can clearly see that some of these bacteria have HIGHLY similar sequences. Therefore they are probably THE SAME SPECIES.

And that is great if we want to conduct STRAIN LEVEL research. But what if we want to just focus on the species level? We would discard most of these and I think that's a waste. So lets try to make them count into our analysis!

But... If I would agglomerate these together using distances (even a small one) i would merge S. aureus and S. epidermidis together...

Differences in S. aureus and S. epidermidis are crucial for the analysis as one is a pathogen and the latter a microbiota - and one prevents the other from expanding. So I need this classification.

I ended up clustering them together by hand. But there are probably other highly similar "OTUs" out there.

What I think is needed is a better automatic clustering of similar "OTUs" that takes into the account the information about species. In the way that It clusters similars together (according to a distance) but prevents clustering if there is already a species level aligned from a previous step.

Is this possible to do?

Or maybe we could assume SPECIES NAME according to the clustering based on distance? how I did it by hand in the picture above? Specifying a species name would also preserve the information that these are indeed same species but probably different strains and can add value to the analysis.

Thanks for all the work you share!

benjjneb commented 7 years ago

You've hit on a fairly deep problem and not one that I have an easy answer for, although I do want to point out the tax_glom function in the phyloseq R package (which interfaces easily w/ dada2 output). tax_glom will lump together all variants that share a taxonomic assignment at the specified level.

However, that doesn't fix the problem you're seeing of incomplete identification at the species level. And, in some sense, there is no total answer to that problem because: (1) 16S gene regions often don't contain enough information to unambiguously assign to species level, and (2) the reference databases we use are incomplete.

The assignSpecies method is a conservative approach focused on leveraging the accuracy of DADA2, it requires an exact match to a classified reference 16S sequence to assign to the species level (note also the allowMultiple option). Classifying non-matching sequences is quite difficult, as the substitution rates in the 16S gene and the completeness of the reference databases both vary greatly among different bacterial clades.

If interested, a couple recent discussions of different approaches to species assignment from 16S data:

SPINGO: a rapid species-classifier for microbial amplicon sequences Validating taxonomy classifiers

As for me, like yourself I'm currently doing assignment by hand for particular taxonomic groups of interest. assignSpecies helps, but using broader databases (eg. BLAST against nt) and domain knowledge adds significantly more information than is available to the general purpose assignment methods.

benjjneb commented 7 years ago

Closing as a broader issue that we are interested in but don't currently have plans on solving.

joey711 commented 7 years ago

I agree with everything @benjjneb said, including the motivation for closure of this issue.

I want to note for other users who might be similarly confused from legacy notions about OTUs from earlier work in the field. The OP is not alone in this confusion, when they state:

" The problem is that... there are sometimes too many.

I get multiple "OTUs" of the same bacteria species.

You can clearly see that some of these bacteria have HIGHLY similar sequences. Therefore they are probably THE SAME SPECIES.

What I think is needed is a better automatic clustering of similar "OTUs" that takes into the account the information about species. "

Problem, restated

The theme of this problem statement, restated, is: Does DADA2 return sequence features that are species?.

The answer to this is:

No.

Explanation

DADA2 is not an OTU method. It is not attempting to cluster similar sequences into taxonomically-motivated groups. Instead, it infers what exact sequences are present in your PCR-amplified DNA sample.

Because of how sequence variation behaves relative to species distinctions the following things are all plausible and common

These are not mistakes, but real and expected biological phenomena. Some of the limitations are derived from known limitations to our reference databases and the algorithms used to annotate these sequences (but probably not a mistake by DADA2). The biggest hurdle in what the OP describes is simply a mistake of applying the old "OTU" way of describing the problem -- which approximately equivocates species and OTU -- with the exact sequence features returned by DADA2.

On the other hand, the OP seems interested in distinguishing different strains from skin. DADA2 is a very good choice for this problem and the data available, for reasons we've described and demonstrated in several places elsewhere. The post-DADA2 interpretation described here needs to incorporate the possibilities afforded by DADA2, which are relatively new and separate from what is often described for OTUs.

Hope that helps. Thanks for the post.

wolass commented 7 years ago

Dear @joey711 and @benjjneb Thank you for taking the time to give such a thorough response. @joey711 - I fully agree, although I'm aware that these are not species or even strains your extensive response helped me and will be a guide for others who start their journey with dada2.

I have used both pipelines: QIIME (with UCLUST) and with this method - the sequences of S. aureus and S. epidermidis (which possess similar sequences) were clustered together, therefore I was wondering why I am seeing so little of S. epidermidis in my data (which is known to be relatively abundant on skin).

After using the dada2 everything became clear and the sample types in my data are now much more clearly defined by different and consistent DSV composition.

I'd like to use this last paragraph to reflect on what I have seen on this repository: (especially in this thread: #62 )

  1. The use of modern collaboration-facilitating tools (github) boosts science by enabling THE MOST RAPID exchange of ideas among researchers of various fields (I'm a dermatologist). This contrasts the now accepted, but flawed forms of disseminating results (articles in peer-reviewed journals).
  2. Vivid discussion on questionable and new topics (therefore relying on opinions or experience of the researchers and not actual scientific facts) can be conducted in a respectful and supporting manner without the loss of scientific quality.
  3. The growing trend towards open, reproducible research will soon excel it to become the state of the art practice and therefore CODING along with version control will be the next latin for ALL scientists independently of their field.
  4. The amount of work you (maintainers) put into responding to questions / issues is tremendous and I'm afraid that you are not being currently properly credited for it in the academia. This has to change, as evaluation of scientific achievements based on published research only is incomplete and fails to capture the impact a scientist's work has on the society.
joey711 commented 7 years ago

Thanks @wolass , very nice of you to say. This is a concern that those in computational and informatics disciplines with biological applications have been struggling with for a while, and in some cases have had pretty good success by associating a peer-reviewed publication to a server or repo. For instance, the content of the phyloseq article in PLoS ONE does not justify its number of views or citations, and I think everyone understands that those citations/views actually reflect the utility of the software package, and the work that continued well after the article was published. We agree on the deficiencies in that approach, but it does help bridge the "gap" while things evolve.