Problem with multiple clusters of the same bacteria - too high resolution.

I like the dada2 approach and I think it greatly enhances how we evaluate NGS data.

I understand how the dada2 algorithm makes unsupervised "OTUs"/sequence variants or whatever we call them. The problem is that... there are sometimes too many.

I was investigating skin microbiome. After performing the dada2 approach (according to doi: 10.12.688/f1000research.8986.2) I get multiple "OTUs" of the same bacteria species. I only have 45 samples and when one variant shown in less than 3 samples - it is discarded according to the pipeline as it does not meet the "PREVALENCE THRESHOLD".

Please look at the figure below.

fig2clustered part

This is the tree after assigning Species -level Taxonomy. Some of them aligned to a reference sequence and some did not.

You can clearly see that some of these bacteria have HIGHLY similar sequences. Therefore they are probably THE SAME SPECIES.

And that is great if we want to conduct STRAIN LEVEL research. But what if we want to just focus on the species level? We would discard most of these and I think that's a waste. So lets try to make them count into our analysis!

But... If I would agglomerate these together using distances (even a small one) i would merge S. aureus and S. epidermidis together...

Differences in S. aureus and S. epidermidis are crucial for the analysis as one is a pathogen and the latter a microbiota - and one prevents the other from expanding. So I need this classification.

I ended up clustering them together by hand. But there are probably other highly similar "OTUs" out there.

What I think is needed is a better automatic clustering of similar "OTUs" that takes into the account the information about species. In the way that It clusters similars together (according to a distance) but prevents clustering if there is already a species level aligned from a previous step.

Is this possible to do?

Or maybe we could assume SPECIES NAME according to the clustering based on distance? how I did it by hand in the picture above? Specifying a species name would also preserve the information that these are indeed same species but probably different strains and can add value to the analysis.

Thanks for all the work you share!

You've hit on a fairly deep problem and not one that I have an easy answer for, although I do want to point out the tax_glom function in the phyloseq R package (which interfaces easily w/ dada2 output). tax_glom will lump together all variants that share a taxonomic assignment at the specified level.

However, that doesn't fix the problem you're seeing of incomplete identification at the species level. And, in some sense, there is no total answer to that problem because: (1) 16S gene regions often don't contain enough information to unambiguously assign to species level, and (2) the reference databases we use are incomplete.

The assignSpecies method is a conservative approach focused on leveraging the accuracy of DADA2, it requires an exact match to a classified reference 16S sequence to assign to the species level (note also the allowMultiple option). Classifying non-matching sequences is quite difficult, as the substitution rates in the 16S gene and the completeness of the reference databases both vary greatly among different bacterial clades.

If interested, a couple recent discussions of different approaches to species assignment from 16S data:

SPINGO: a rapid species-classifier for microbial amplicon sequences Validating taxonomy classifiers

As for me, like yourself I'm currently doing assignment by hand for particular taxonomic groups of interest. assignSpecies helps, but using broader databases (eg. BLAST against nt) and domain knowledge adds significantly more information than is available to the general purpose assignment methods.

Closing as a broader issue that we are interested in but don't currently have plans on solving.

I agree with everything @benjjneb said, including the motivation for closure of this issue.

I want to note for other users who might be similarly confused from legacy notions about OTUs from earlier work in the field. The OP is not alone in this confusion, when they state:

" The problem is that... there are sometimes too many.

I get multiple "OTUs" of the same bacteria species.

You can clearly see that some of these bacteria have HIGHLY similar sequences. Therefore they are probably THE SAME SPECIES.

What I think is needed is a better automatic clustering of similar "OTUs" that takes into the account the information about species. "

Problem, restated

The theme of this problem statement, restated, is: Does DADA2 return sequence features that are species?.

The answer to this is:

No.

Explanation

DADA2 is not an OTU method. It is not attempting to cluster similar sequences into taxonomically-motivated groups. Instead, it infers what exact sequences are present in your PCR-amplified DNA sample.

Because of how sequence variation behaves relative to species distinctions the following things are all plausible and common

(1) The exact same genome can have different sequences, because of multiple copies of the target gene. DADA2 can detect these, and I've verified this in real data from known genomes, and moreover that the count ratios match what you'd expect.
(2) Two strains from the same species can have the same sequence in the region you've amplified. No algorithm, DADA2 included, can distinguish the strains in this case. However, you'd still have a shot at classifying the species correctly. Note how the species classification is referenced as a procedure that is separate from - and happens after - DADA2. That is because it is. The dada2 R package provides some classification functions for convenience, but the DADA2 algorithm is not a taxonomic classification algorithm, and should not be confused with one.
(3) Two strains from the same species can have different sequences in the region amplified. DADA2 can often tell these apart, even if they differ by 1 letter. These will likely be called the same species by your taxonomic classification step, if species is called at all.
(4) Two different species can have the same sequence in the region you've amplified. This becomes less likely with increasing amplicon sequence length and increasing target region variability, but there are plenty of examples where two strains that are called as being different species actually have the same sequence within a particular variable region. This is either a mistake in the taxonomy assigned to each strain, or simply an evolutionary edge case that we have to expect from time-to-time, because that small region of the 16S did not accumulate changes since their MRCA.

These are not mistakes, but real and expected biological phenomena. Some of the limitations are derived from known limitations to our reference databases and the algorithms used to annotate these sequences (but probably not a mistake by DADA2). The biggest hurdle in what the OP describes is simply a mistake of applying the old "OTU" way of describing the problem -- which approximately equivocates species and OTU -- with the exact sequence features returned by DADA2.

On the other hand, the OP seems interested in distinguishing different strains from skin. DADA2 is a very good choice for this problem and the data available, for reasons we've described and demonstrated in several places elsewhere. The post-DADA2 interpretation described here needs to incorporate the possibilities afforded by DADA2, which are relatively new and separate from what is often described for OTUs.

Hope that helps. Thanks for the post.

Dear @joey711 and @benjjneb Thank you for taking the time to give such a thorough response. @joey711 - I fully agree, although I'm aware that these are not species or even strains your extensive response helped me and will be a guide for others who start their journey with dada2.

I have used both pipelines: QIIME (with UCLUST) and with this method - the sequences of S. aureus and S. epidermidis (which possess similar sequences) were clustered together, therefore I was wondering why I am seeing so little of S. epidermidis in my data (which is known to be relatively abundant on skin).

After using the dada2 everything became clear and the sample types in my data are now much more clearly defined by different and consistent DSV composition.

I'd like to use this last paragraph to reflect on what I have seen on this repository: (especially in this thread: #62 )

The use of modern collaboration-facilitating tools (github) boosts science by enabling THE MOST RAPID exchange of ideas among researchers of various fields (I'm a dermatologist). This contrasts the now accepted, but flawed forms of disseminating results (articles in peer-reviewed journals).
Vivid discussion on questionable and new topics (therefore relying on opinions or experience of the researchers and not actual scientific facts) can be conducted in a respectful and supporting manner without the loss of scientific quality.
The growing trend towards open, reproducible research will soon excel it to become the state of the art practice and therefore CODING along with version control will be the next latin for ALL scientists independently of their field.
The amount of work you (maintainers) put into responding to questions / issues is tremendous and I'm afraid that you are not being currently properly credited for it in the academia. This has to change, as evaluation of scientific achievements based on published research only is incomplete and fails to capture the impact a scientist's work has on the society.

Thanks @wolass , very nice of you to say. This is a concern that those in computational and informatics disciplines with biological applications have been struggling with for a while, and in some cases have had pretty good success by associating a peer-reviewed publication to a server or repo. For instance, the content of the phyloseq article in PLoS ONE does not justify its number of views or citations, and I think everyone understands that those citations/views actually reflect the utility of the software package, and the work that continued well after the article was published. We agree on the deficiencies in that approach, but it does help bridge the "gap" while things evolve.

benjjneb / dada2

Problem with multiple clusters of the same bacteria - too high resolution. #158

Problem, restated

Explanation