What to call a dada2 sequence?

jeffkimbrel commented 8 years ago

As I explain to other researchers the merits of DADA2, I am always at a loss as to what to call the final sequences from the DADA2 results. I usually just say "OTU", because it is easy to understand, but then quickly clarify they aren't actually OTUs or the results of clustering. I also call them "sequences", but that is too vague of a term. I also sometimes use "taxa", but that won't work universally if people use DADA2 on something other than 16S type data. "Amplicons" also doesn't seem to fit.

Is there any sort of consensus about what to call the individual sequences?

spholmes commented 8 years ago

Can we call them strains?

On Fri, Mar 25, 2016 at 10:16 AM, Jeff Kimbrel notifications@github.com wrote:

As I explain to other researchers the merits of DADA2, I am always at a loss as to what to call the final sequences from the DADA2 results. I usually just say "OTU", because it is easy to understand, but then quickly clarify they aren't actually OTUs or the results of clustering. I also call them "sequences", but that is too vague of a term. I also sometimes use "taxa", but that won't work universally is people use DADA2 on something other than 16S type data. "Amplicons" also doesn't seem to fit.

Is there any sort of consensus about what to call the individual sequences?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/benjjneb/dada2/issues/62

Susan Holmes Professor, Statistics and BioX John Henry Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

benjjneb commented 8 years ago

This is a good question, and you aren't the first to wonder. I've started calling them "sequence variants" or "ribosomal sequence variants (RSVs)" in the 16S context.

I don't think its appropriate to call them strains, as there is not a 1-1 map between even full length 16S sequences and bacterial strains, much less outside the 16S context. I think "sequence variant" concisely conveys what it is that DADA2 outputs, and makes sense for whatever genetic locus is being amplified.

spholmes commented 8 years ago

Thanks Ben, Indeed I agree with sequence variants as then it can be connected to the other literature like the deepSNV package, SNV=Single Nucleotide Variant but they also use the terminology Subclonal variant Susan

On Fri, Mar 25, 2016 at 11:43 AM, benjjneb notifications@github.com wrote:

This is a good question, and you aren't the first to wonder. I've started calling them "sequence variants" or "ribosomal sequence variants (RSVs)" in the 16S context.

I don't think its appropriate to call them strains, as there is not a 1-1 map between even full length 16S sequences and bacterial strains, much less outside the 16S context. I think "sequence variant" concisely conveys what it is that DADA2 outputs, and makes sense for whatever genetic locus is being amplified.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/benjjneb/dada2/issues/62#issuecomment-201415143

Susan Holmes Professor, Statistics and BioX John Henry Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

ggloor commented 8 years ago

By analogy to OTUs, what about Identical Sequence Units (ISUs)?

ngarud commented 8 years ago

Hello,

I have a related question -- in your example online, the taxa output is as follows:

[,1]       [,2]            [,3]          [,4]
[1,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
[2,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
[3,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
[4,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
[5,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
[6,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
[,5]                 [,6]
[1,] "Porphyromonadaceae" NA
[2,] "Porphyromonadaceae" NA
[3,] "Porphyromonadaceae" NA
[4,] "Porphyromonadaceae" "Barnesiella"
[5,] "Bacteroidaceae"     "Bacteroides"
[6,] "Porphyromonadaceae" "Barnesiella"

Would you say that there are two strains (or, sequence variants) of Barnesiella and potentially 3 strains of Porphyromonadaceae in the above output?

Also, can DADA2 be re-purposed for distinguishing strains/sequence variants at any random gene in the metagenome?

Thanks! Nandita

benjjneb commented 8 years ago

Yes, I would say from the above that there are 2/3 sequence variants of Barnesiella/Porph in the sequenced community. Those may be (even are likely to be) different strains, but there is an issue that some bacteria have multiple distinguishable 16S sequences, which is something to be aware of. Mikhail Tikhonov's work has dealt with this issue in the context of high-resolution amplicon bioinformatics if that's of particular interest (eg: http://www.nature.com/ismej/journal/v9/n1/abs/ismej2014117a.html).

On your second question: Yes, the dada2 method is reference free and can be used on any* genetic locus. However, as currently implemented it is intended for amplicons, and using it on shotgun metagenomics sequences is not supported, although is possible to do on specific short genetic regions extracted from shotgun data.

* For technical implementation reasons, gene regions with extreme length variation are not well handled by dada2. By which I mean ITS. This will hopefully be remedied in the future.

Edit: The dada2 package as of version 1.4 handles variable length amplicons, and works (really well) on ITS data.

gregcaporaso commented 8 years ago

Just came across this question, and I've been thinking about the same thing (generally, for "dereplicated" rather than clustered sequence data).

In my opinion, these are still OTUs. OTU doesn't imply clustering, it's just been used to describe the results of clustering for a while now. I think it's better to push the community toward considering these as higher resolution (i.e., 100%) OTUs, rather than trying to use new terminology.

naarkhoo commented 8 years ago

since you have bench-marked DADA2 with others - why don't you use the same terminology as others are using ? (OTU ?)

benjjneb commented 8 years ago

The reason why different terminology might make sense is that there is a fundamental conceptual difference between what dada2 is doing and what OTU picking methods do.

dada2 is inferring the real sample sequence variants in a sample.

OTU pickers are clustering sequencing reads within some similarity threshhold.

There are obvious analogies between the two, both methods turn sample-wise amplicon sequencing into a feature table that can be analyzed in much the same way downstream. But the conceptual differences have meaningful consequences when looking closer. For example, if dada2's error model says that read i is more likely to be produced by sequence j than sequence k, it will be assigned to sequence j even though it might be closer in hamming distance to sequence k.

joey711 commented 8 years ago

@gregcaporaso @naarkhoo I'm leaning towards @benjjneb on this one.

The output from DADA2 is neither Operational nor Taxonomic. The conventional assumptions implied by the term OTU, both originally/formally and in common scientific use, simply do not apply. I suppose the key points being that DADA2 output sequences are

Intrinsically labeled, because they are the observed real amplicon sequence fragment, rather than a not-biologically-motivated cluster of sequences that might not belong together. OTUs only achieve this comparability property via a reference database, or (re)clustering sequences from separate experiments.
In many cases for 16S amplicons you have intragenomic variants that can/should be separate entries/features, and are so in DADA2 output. This is quite different from the common notion of OTUs -- where at best you've got a collection of sequences from the same species that behave ecologically similarly, and at worst... you don't. That makes DADA2 output appropriate for certain precision applications where OTU clustering simply is not. This distinction has some important translational implications for various clinical/diagnostic applications, etc.
100% OTUs would communicate to most users that the sequences were simply dereplicated (and therefore still contain a lot of false, sequencing-error-derived features) and/or merely processed through some agglomerative clustering algorithm with a near-zero inclusion radius. I find these very understandable interpretations of the term 100% OTU to be quite misleading for output from DADA2, and also categorically different from a technical point of view. Point in case, I can ask several common OTU clustering algorithms to construct OTUs with a 0% similarity radius, but the output in that case would be nearly useless, and not at all comparable to what DADA2 outputs.

Since this is entirely a semantic discussion, it is relevant that the overwhelming usage of the term OTU has become synonymous with a subset of clustering algorithms, and even a specific similarity radius. I for one am excited to go back to talking about sequence variants, strains, and species -- rather than some arbitrary rough approximation of these much older and broader biological notions.

My two cents.

lkursell commented 8 years ago

I hedge with "sequence features"

joey711 commented 8 years ago

@lkursell that's not really a hedge, since the suggestion was to still call these OTUs.

Sequence features != OTUs

gregcaporaso commented 8 years ago

100% OTUs would communicate to most users that the sequences were simply dereplicated

I sort of agree with that. Note that in any pipeline that I'm aware of, there is some quality control applied, whether it be the old 454 denoising approaches, the approach we presented in Bokulich et al. (2013) (which we all now agree is overly permissive), or DADA2. So I think when people hear, for example, 97% OTU, they assume that some sort of quality control has been applied. However, I do think that for DADA2 it would be good to use a name that implies that this type of pipeline has been applied. Denoise is the term that comes to mind for this.

I like sequence variant too, but in my view that's essentially synonymous with 100% OTU (though sequence variant has a nicer ring to it), so I feel like that has that same issue - it doesn't imply that DADA2's noise correction approaches have been implied. What about something like denoised sequence variants (or denoised 100% OTUs, but I seem to be on the losing end of that vote).

I can ask several common OTU clustering algorithms to construct OTUs with a 0% similarity radius, but the output in that case would be nearly useless

I think useless is a bit extreme. I've observed some nice results using vsearch to generate "100% OTUs" from QIIME-quality-filtered data with very clear clinical implications (paper is under review right now, and unfortunately I haven't been able to post a pre-print, but the bioinformatics protocol is here). Better quality filtering would undoubtedly improve the results, but based on some results I've generated these are more informative than 97% OTUs generated from the same data. I'm re-running this analysis with DADA2 now, so that might make it into the final paper.

Thanks for having this discussion publicly! I also just want to mention that I'm really excited about DADA2. I've been starting to use it in some of the analyses I'm involved with, and I've been impressed with the results so far. I'd really like to chat with you guys about working together to develop a QIIME 2 plugin around this (some QIIME 2 info here). I'll follow up on that separately.

joey711 commented 8 years ago

Thanks @gregcaporaso ! A plugin sounds very interesting. Glad to hear you're starting to use it and find clinically-relevant benefit. Been my experience as well thus far.

I agree my phrasing of useless for 100% OTUs is an oversimplification, since many pipelines (like UPARSE) throw in chimera filtering. On the other hand, I have the impression that the average user finds a 100% OTUs table very challenging as a starting point because they're getting buried in 10-1000X spurious features, and even someone pretty savvy at defining post-clustering filters is only going to get so far relative to tools designed for denoising, like DADA2, MED, etc.

On the flip side, and I probably should have included this in my list above, I can envision settings in which the investigator performs OTU clustering on the DADA2 output. This is a perfectly reasonable thing to do if OTUs are good enough for your particular biological question (I've done this myself in some special cases w/ UPARSE after DADA2). Aside from an anecdotal use-case, I think this helps clarify the distinction, lest we arrive at some embarrassing semantic hell in which we're talking about double OTU clustering or 97% clustering of the 100% OTU clusters... or some worse incarnation I don't want to imagine.

Thanks for posting the protocols and your additional insight and positive comments. I think I like your suggestion of denoised sequence variant. On first thought it seems to convey the key details we want, and confers that all-important 3-letter acronym O:-) -- long live the DSV? ...

Hmm, some disambiguation may be needed from this other file format DSV...

dadahan commented 7 years ago

I'm coming across this post rather late, but in recent work I brooded over this for too long and opted to call them ISeVs, for inferred sequence variants. It rolls off the tongue nicely, won't get confused with our common understanding of OTUs and details that "DADA2 infers sample sequences exactly".

joey711 commented 7 years ago

Thanks @dylandahan I'm glad we both agree that calling these OTUs is inappropriate. I'm a little confused how ISeVs rolls off the tongue (maybe a phonetic spelling would help clarify?). Why not ISVs? Or DSVs?

Perhaps we should have a crowdsource vote, and offer to post the winning title in the package documentation and future pubs?

dadahan commented 7 years ago

Hi @joey711 , thanks for the response. Well, I started with ISVs and that evolved into ISeVs and I imagined it colloquially pronounced as, "I-sevs". DSVs is great, too!

I agree that a crowdsource vote is the way to go, and I'd be glad to update a submitted pub with the winning title.

benjjneb commented 7 years ago

Relevant to this thread, Robert Edgar just released a preprint on his method to infer exact sequences, and in it he introduced a new terminology: "ZOTU", for zero-difference OTUs.

joey711 commented 7 years ago

Still not clear to me that minting a new acronym is appropriate at all. Definitely not one that builds on top of "OTUs", since these are not a taxonomic approximation, and not operationally defined...

benjjneb commented 7 years ago

I have to say, between "[ribosomal/denoised/NA] sequence variant", "oligotype", "[Z/100%] OTU", and other usages, the terminology here is a bit of a disaster at the moment.

I agree with the reasons expressed upthread on why 100%/Z OTUs is a misleading terminology, and have adopted the "* sequence variant" terminology in all of my current applied project. But I think more input would be valuable, especially from key people like @meren and Robert Edgar (emailing him this thread).

gregcaporaso commented 7 years ago

When we taught using q2-dada2 in Iceland last week the students seem to get the term denoised sequence variant. I'm still in favor of that.

joey711 commented 7 years ago

I agree. Consensus is important. Ideally we'd converge on a single agreed-upon term to represent the set of unique denoised amplicon sequences. I also pose that this term should be as clear and intuitive as possible, or alternatives will continue to be proposed and defined over and over again. My preference would also be that it is not an acronym, since this tends to obfuscate the meaning with an added layer of complexity, and we're not the first sub-discipline to ever think about what to call a set of similar but unique sequences.

@gregcaporaso I like that phrase, too. Might be my top choice as well.

benjjneb commented 7 years ago

@gregcaporaso @joey711 I also think denoised sequence variant is an effective term.

Especially after experiencing multiple collaborators make the logical, but wrong, conclusion that "100%" or "exact" OTUs were a fancy way to say dereplicated reads.

rcedgar commented 7 years ago

(From Robert Edgar) My 5c++: Consistency is good, obfuscating jargon is bad. I use "denoising" even though I don't like the term, I think the simpler and more self-explanatory "error-correction" would be better. There are many good points in this thread. There is no perfect solution; I think ZOTU should be the default choice because it's the first published (or at least preprinted) proposal and is as good / bad as anything else suggested here. For sure, I would need a compelling reason to change it at this late stage, which seems unlikely. I had a long debate with myself not unlike this discussion, and I felt that ZOTUs was a winner. It implies that they are related to conventional NGS OTUs, but different. It suggests that you can do all the usual things with them (alpha and beta diversity, PCoA, taxonomy prediction etc.), which may not be obvious to the typical biologist. Plus, it's pronounceable, a bit quirky and might encourage busy biologists to find out more, while something like "denoised sequence variant" is off-puttingly technical -- it embeds the already obscure term "denoising".

spholmes commented 7 years ago

I believe that we should stick with the standard literature used in many other fields of genomics and use the "sequence variant" vocabulary. This is used in genetic contexts (In Biconductor alone there are many packages that use this vocabulary (deepSNV, RareVariantVis, VariantAnnotation,VariantFiltering,....) and as we move to multi-omics settings, it is advantageous to have a general cross-boundary understanding), for virology and many other areas in microbiology, I do not think it is worth separating ourselves from these other fields, in our recent bioconductor workflow paper https://f1000research.com/articles/5-1492/v1 we use Ribosomal Sequence Variants (published in June 2016).

In the DADA2 nature paper (published in April, 2016) http://www.nature.com/nmeth/journal/v13/n7/full/nmeth.3869.html we just use the word variants.

Whether one qualifies sequence variant with denoised or strain or another term may also depend on context as in the case of HIV, cancer etc..

Susan

On Thu, Oct 20, 2016 at 8:25 AM, compsc1 notifications@github.com wrote:

(From Robert Edgar) My 5c++: Consistency is good, obfuscating jargon is bad. I use "denoising" even though I don't like the term, I think the simpler and more self-explanatory "error-correction" would be better. There are many good points in this thread. There is no perfect solution; I think ZOTU should be the default choice because it's the first published (or at least preprinted) proposal and is as good / bad as anything else suggested here. For sure, I would need a compelling reason to change it at this late stage, which seems unlikely. I had a long debate with myself not unlike this discussion, and I felt that ZOTUs was a winner. It implies that they are related to conventional NGS OTUs, but different. It suggests that you can do all the usual things with them (alpha and beta diversity, PCoA, taxonomy prediction etc.), which may not be obvious to the typical biologist. Plus, it's pronounceable, a bit quirky and might encourage busy biologists to find out more, while something like "denoised sequence variant" is off-puttingly technical -- it embeds the already obscure term "denoising".

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/62#issuecomment-255139203, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcveGcQbkL2S7f4zcqFfEXQg3UTjO4ks5q14gAgaJpZM4H44r2 .

Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

meren commented 7 years ago

I don't see any microbial ecologist* here. I think you should consider including them.

That aside, my 2 cents agrees with the notion that going away from the term OTU may be beneficial for historical reasons to the field (rather than technical reasons at this point).

Denoised Sequence Variant sounds good to me, but Denoised Amplicon Sequence sound even better, because the application of all these methods discussed here will be only relevant to amplicons. Denoised sequence variant suggest more than what it should. 'Sequences', 'noise', and 'variants' are not limited to marker gene amplicons to infer community structure.

2 cents.

Best,

* I mean, we all have become a little bit of microbial ecologists by this time, but I felt that it would be better to have some people who are coming from a microbiology background exclusively. I wanted to clarify this as I realized it might sound offending to some.

gblanchard4 commented 7 years ago

I am not a huge fan of using any form of OTU.

Internally we have been calling them dadas for Dereplicated And Denoised Amplicon Sequences

joey711 commented 7 years ago

To quickly address Robert Edgar's ( @compsc1 ) false statement: "ZOTU should be the default choice because it's the first published (or at least preprinted) proposal"

It is demonstrably not the first, and therefore by that same logic, not the default. As many of us have been discussing on this thread (including @spholmes response above), there are many proposed terms already in the published literature of our field, long preceding Edgar's very recent pre-print. This includes the DADA2 article that is referenced extensively by that same preprint. "Oligotype" is a term I've been hearing for a few years now to refer to this same thing, whether or not the methods used at the time were satisfactory. More importantly, if we're going to bother discussing it at all, our criteria should be to agree on the most informative, descriptive, least-confusing term that we're all willing to use; precedence be damned.

Along this line, I agree with @meren proposed "Denoised Amplicon Sequence" as a good option, and that we want to avoid the historical legacy of "OTUs" as a feature, not a bug.

As for @gblanchard4 "dadas", while I'm very tickled by that term, my immediate concern is that it undermines itself because it sounds too much like DADA the algorithm. A good term for the sequences will be more general/abstract than the algorithm(s) used.

And finally, I agree that @compsc1 suggestion of "error-correction" might be slightly better than "denoising", though it isn't clear to me that the latter is all that confusing or technically-deep.

rcedgar commented 7 years ago

To Paul M's "false statement": we all make mistakes, but I prefer to believe that we're all working in good faith to do the best possible science here. There may be some confusion about exactly what we are talking about: (a) the true biological sequences in a sample or (b) the predicted sequence variants obtained by the highly biased and imperfect protocol of amplicon sequencing + denoising, which I call ZOTUs. Typical ZOTUs will have some uncorrected sequencing and PCR errors, will fail to detect some low-abundance valid variants, and will suffer other problems due to contaminants, cross-talk, primer mismatches etc. As far as I'm aware, previous papers have not used a separate term for (b) as opposed to (a) but of course I could easily have overlooked something. If the distinction between predictions and biological ground truth is not useful then a separate term for dadas/DSVs/ZOTUs is not needed and I would agree with Susan H. However, in my experience, many biologists are woefully ignorant of the limitations of 16S so I lean towards using a term that warns "prediction, caveat emptor".

lkursell commented 7 years ago

A few observations from this discussion:

(1) All of these new methods (dada2, oligotyping, unoise2) produces sequences, and the sequences themselves are reported as the features. Because of this, any association with OTUs should be removed, because these are not "operational" (which is vague), or "taxonomic" (no taxonomies of any kind are being indicated by the sequences), or "units" (because sequences aren't being compared to one another such as much as compared to what their true biological sequence might have been). (2) I think the sense of using "error-corrected" over "denoised" is high, because it removes any mention of a method from the resulting features, and "error-corrected" is far more universal. This reasoning could lead to comments from other such as "the user might not know what is happening", but honestly the users can't use blindly use programs - as @compsc1 said, "caveat emptor". (3) I don't think that adding the term "variant" to sequence provides anything. A variant of what? The true sequence? The corrected sequence? A variant of the 16s amplicon? (4) As a result, I think "error-corrected sequences" is about as general and descriptive as you can get. If an individual package would link to "brand" their particular set of "sequences + methods" as something else, then we cannot remove that option, but they would fall under the umbrella term "error-corrected sequences".

wdwvt1 commented 7 years ago

I feel like this thread lacks historical perspective; my understanding is that 'operational taxonomic units' arose in macroecology and were employed in bacterial ecology starting in the 1960s (see here and here). An OTU was a grouping of organisms with some level of connectedness, and that connectenedness was originally measured on a whole host of phenotypic traits (substrate specificity, geographic range, ability to interbreed, etc.)

In this sense, OTU is a contextual term depending on similarity function/goal of the ecological analysis.

For instance, if I were interested in tracking microbial transport between members of a household, I would be interested in the exact sequences each had. The perfect unit of differentiation for this kind of ecological analysis would be something that allowed me to say unequivocally which microbe came from which household member. Here then, a difference of a single base pair in 16S would be meaningful, so I'd want my OTUs to differentiate between those things.

On the other hand, if I were interested in the butyrate production capacity of the household microbiota, the operational difference would be butyrate production capacity. In that case, the OTUs should reflect that difference; if two microbes have the same ability to produce butyrate, but have 3nt differences in their 16S, I'd want them in the same OTU.

In short, I am trying to say that phylogeny has become synonymous with taxonomy, and we should move away from that. OTUs used to be defined based on the relevant ecological characteristic. I feel like we should encourage movement back in that direction.

As such, the output sequences of DADA2 are OTUs for a very specific type of analysis (e.g. when 1nt differences are meaningful) but are otherwise not. We should try to retain OTU for grouping into the relevant unit of analysis, but not confuse the output denoised/error-corrected sequence with that relevant grouping.

elsherbini commented 7 years ago

Personally I prefer the word inferred over denoised or error-corrected. The inference is done based on a statistical model, with certain assumptions, and real sequence variants in a sample that are too low in abundance and too close to another sequence could be lost, while errors such as (n>2)-meras or early-round PCR issues could still be around. The word inferred correctly signals to a reader that these sequences have had some "stuff" done to them.

I like sequence variant for talking about a particular sequence. You could say "There were 3600 unique inferred sequences in the sample," but when talking about a particular one you could say "The most abundant inferred sequence variant accounted for 30% of the reads." It's a variant of the locus you amplified.

joey711 commented 7 years ago

@compsc1 please don't mistake my firm language beyond its literal meaning. I have a nearly-allergic reaction to overly-strong assertions of precedence, particularly when they're used to bolster a point beyond its merits. That's as much my fault as yours. I do assume you're working in good faith, and your comments and criticism are more than welcome here.

Along these lines and your later point, I would propose that your distinction between prediction and biological ground truth is not necessary, but a good point to bring up here. Going out of our way to maintain distinct nomenclature is more confusing. I do agree that the concept between measurement and truth is often misunderstood, but it shouldn't be a foreign concept among scientists. The fact that our measurement (method + noise-correction) is going to have mistakes -- but a truth actually exists -- is made more clear when we refer to the truth and the measurement using the same word(s), in my opinion.

I tend to agree with @elsherbini that inferred sequences makes this point at least as well.

@wdwvt1 Thanks for that historical comment regarding OTUs. I think several of us agree that this is quite separate from OTUs. The notion of "truth" is also relevant here. What's a "true" OTU? The answer to that, to the extent you can define it, is very different than "What are the true sequences in this sample?". Of the many reasons this is a useful distinction, it is an objective basis on which to evaluate our methods -- and any OTU-based inference downstream is likely to benefit from improving our inference of the true sequences.

wdwvt1 commented 7 years ago

@joey711 - absolutely agree about better inference from better input data.

As far as the 'truth' of an OTU is concerned: I am arguing that people are deriding the term because they are ignoring the historical context. I don't want to throw the baby out with the bathwater; we should definitely not call DADA2 features OTUs by default, but that is not because an OTU as a concept is bad.

There isn't a 'truth' value to an OTU (except insofar as it maps from the realistic separation of biological units in the context of the question you are asking). It's a way to operationally group measured units into analyzable units. If this is your argument, I apologize for the additional post, I just wanted to really make sure I made this argument because I feel like realizing what OTUs are and are not for is far more likely to be a basis of confusion/bad science than the 3-letter acronym people use.

colinbrislawn commented 7 years ago

I notice that many terms discussed mirror the method used to make it.

Method	What we call the result:
MED (minimum entropy decomposition)	Oligotype, 'MED nodes'
data, dada2	dadas, DSV,
unoise, unoise2	ZOTU
swarm	swarms?
greedy heuristic clustering 1 2 3 4	OTU
mothur's neighbor joining method	OTU
deblur	DSV? deblurred reads?

The term for the result communicates the goal of the author along with the method they employed to pursue this goal.

While we don't have consensus on method, do we have a consensus on the goal?

Method	Stated Goal
UNOISE2	provide the maximum possible biological resolution given the data
dada2	accurately reconstruct amplicon-sequenced communities at the highest resolution
swarm	properly delineates large OTUs (high recall), and can distinguish OTUs with as little as two differences between their centers (high precision).
MED	sensitive discrimination of closely related organisms ... without relying on extensive computational heuristics and user supervision

So... at least out goals are similar 😉

benjjneb commented 7 years ago

I think the goal is crucial. It is because the goal changed between OTU methods and the new methods that I don't like OTU-variant terms.

The developers of these other methods can correct me if I'm wrong, but I think DADA2, MED and UNOISE2 all have the same goal: report to the user the sequences in their sample.

Nothing operational. No arbitrary unit. The DNA sequences in the sample resolved to their limit.

As @lkursell and @elsherbini pointed out above, it is important to note that we are doing out best to reconstruct the real sequences through an inference or denoising or error-correction method, not reporting an oracle truth. Hence I like terms like inferred sequence or denoised sequence variant. I find those sort of descriptors both understandable for those unfamiliar with methods on the field, and a more accurate description of the output of these methods for those assuming an OTU approach.

colinbrislawn commented 7 years ago

Hence I like terms like inferred sequence or denoised sequence variant.

Me too. I think 'DSV' rolls off the tongue at least as well as 'OTU,' and gives us the opportunity to discuss exactly how we are 'denoising' and detecting these 'sequence variants.'

While this discussion helps explore what we hope to find in the post-OTU world, I know that language is not prescriptive and you can't make people use your words.

I wonder if @pschloss and @mothur-westcott have plans to replace OTUs?

I wonder if @frederic-mahe would consider the centroid of a swarm to be a DSV? Are 'swarms' just 'denoised sequence variantS'?

Thank you all for participating in this discussion!

frederic-mahe commented 7 years ago

Thanks @colinbrislawn for including me in that interesting discussion.

You are right, we develop different approaches, but we aim for the same goal: get as much useful information as possible from metabarcoding data.

When I am asked to present swarm to biologists, I use the acronym "OTU" to describe swarm's output, as most people are familiar with it, but I don't really like it. If I have the time to go into details, I try to explain that swarm's job is to deal with microvariants, which are sequences at a short edit distance from a more abundant sequence. Discarding macrovariants (weird sequences, chimeras, contaminants, etc.) is performed latter by other tools, as it requires additional data such as sequence occurrences throughout samples, chimera detection, taxonomic assignment, quality filtering, etc. In the Unix way, swarm is designed to do only one thing.

So, if I may add to the confusion, I see swarm's output as "clouds of microvariants". But If I had to define the final output of my pipeline, after micro- and macrovariant filtering, I think I would be happy with "denoised amplicons".

Personally I prefer the word inferred over denoised or error-corrected. The inference is done based on a statistical model, with certain assumptions, and real sequence variants in a sample that are too low in abundance and too close to another sequence could be lost, while errors such as (n>2)-meras or early-round PCR issues could still be around. The word inferred correctly signals to a reader that these sequences have had some "stuff" done to them.

@elsherbini, I disagree with using the term "inferred" in that context. In my opinion "inferred" suggests that sequences can be artificial, like consensus sequences. This is not the case: the different clustering/denoising processes pick certain sequences among a larger set of observed sequences. The key thing here is that all sequences are observations, we are merely deciding to consider only some observations as valid.

joey711 commented 7 years ago

Thanks @frederic-mahe for joining the discussion.

Just to make sure we maintain clarity in the thread, while I certainly agree that we share the general goal of (restated only slightly) "getting useful information from amplicon sequencing data" -- the goal that @benjjneb and myself and a few others defined earlier in the thread was much more specific than that, and clearly defined with an objective measure of performance. Perhaps most- importantly to the discussion about nomenclature for methods that perform exact sequence inference, the goal of DADA2, MED, and UNOISE2 are fundamentally different than the goal of OTU clustering methods.

Please correct me if I'm wrong, but swarm is an OTU-clustering method, and you are not claiming that swarm attempts to infer exact sequences from amplicon sequence data. Therefore, I don't imagine that you feel the need a term other than OTU to describe its output, since OTUs have always implied a sequence cluster since the term was first defined.

Your final statement / response to @elsherbini seems strange to me. Every measurement we ever make contains noise. We make inferences (whether explicit or implicit) from these noisy observations the moment we attempt to associate meaning. The term infer in this technical context (let's say, a verb "to make a well-defined inference about a parameter, value, or phenomenon") is a very general one, and very relevant to the problem described in this package and this thread: To infer -- from a set of sequences we know to be full of errors -- the set of true sequences and their abundance. This is not to say at all that the sequences are artificial. They are the data, the observations. The data would be artificial only if it were simulated or fabricated, and no one was suggesting that to be the case; but rather that the sequence data contains errors because our molecular processes and sequencing technology are not perfect. We don't have to pretend that we expect every sequencing read to be perfectly accurate sequence, and I don't think that is what you were implying; but it does follow, given our precise stated goal above, that @elsherbini and @benjjneb and others' suggestion that the language we use should reflect the fact that we're inferring the true sequences from imperfect measurements.

joey711 commented 7 years ago

Just stumbled across @meren 's page describing "oligotyping" and related articles. Thought it made sense to link here:

http://merenlab.org/software/oligotyping/

One of those papers goes back to 2012. For those concerned about precedence on terminology, @meren mentioned above he liked DSV (Denoised Sequence Variant) or DAS (Denoised Amplicon Sequence).

I'm fine with either. Why don't we take a vote on the different terminology that has been proposed? The term we'd be voting on would be what to call an amplicon sequence in your data that you believe to be real, irrespective of the algorithm that was used or its respective accuracy.

I'll link a Google Poll here if so. I tried some git-poll thingy but it was limited to thumbs-up/thumbs-down.

spholmes commented 7 years ago

I'm happy to participate, we've been using RSV in papers, happy ot change to DSV or other as decided.

On Thu, Mar 2, 2017 at 4:15 PM, Paul J. McMurdie notifications@github.com wrote:

Just stumbled across @meren https://github.com/meren 's page describing "oligotyping" and related articles. Thought it made sense to link here:

http://merenlab.org/software/oligotyping/

One of those papers goes back to 2012. For those concerned about precedence on terminology, @meren https://github.com/meren mentioned above he liked DSV (Denoised Sequence Variant) or DAS (Denoised Amplicon Sequence).

I'm fine with either. Why don't we take a vote on the different terminology that has been proposed? The term we'd be voting on would be what to call an amplicon sequence in your data that you believe to be real, irrespective of the algorithm that was used or its respective accuracy.

I'll link a Google Poll here if so. I tried some git-poll thingy but it was limited to thumbs-up/thumbs-down.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/62#issuecomment-283825143, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvRgzgdxvvSZ5ZMdfYqrt9cHnV_Sbks5rh1ubgaJpZM4H44r2 .

-- Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

joey711 commented 7 years ago

Here is the poll. Please spread around, but perhaps also encourage people to review the discussion on this thread before voting:

https://goo.gl/forms/qGHu7cvBsHzu660O2

meren commented 7 years ago

I realized during an exchange with Ben that the 2011 paper (in which the oligotypes were first described), there is this sentence right in the abstract:

We observed a high degree of low-level diversity among G. vaginalis sequences with a total of 46 unique sequence variants (oligotypes), and also found strong correlations of these oligotypes between sexual partners.

I guess I support 'SVs' a little more than other options, and although I didn't realize it before, I in fact supported it way earlier than I appreciated.

I still have the feeling that the term DSV says just a tiny bit more than what it should say. Because we are not necessarily 'denoising' the sequence variant itself. I mean, we are not changing the actual sequence content of the sequence variant, we keep it as it comes out of the sequencer, but we expand its reach by also counting the ones that diverge from it in erroneous ways when we explain its abundance in the data. For this reason, I feel DSV does not not communicate the nature of the data as well as it could. On the other hand, 'ASV' (Amplicon Sequence Variant) as a term would have been the minimum amount of assumptions, maximum amount of accuracy, also reflecting the amplicon nature of the data (I apologize for going from DAS to ASV, I hope I am not more confusion than help).

Finally, I am not sure if voting can capture these subtle discussions. The history teaches us that when people vote without spending enough time to think things through carefully, the outcomes can sometimes be against their long term benefits.

Best,

colinbrislawn commented 7 years ago

I'm pretty sure @meren's work on oligotyping is the first / oldest in comp bio to move away from OTUs. It never caught on (I feel it's conceptual complicated compared to clustering). The follow-up method, MED, has been used more.

@joey711 Looking for old papers? Check out this 1999 paper which uses 'point mutations in five loci' to separate taxa with a brand new method they called 'oligotyping'

Old methods die hard. Getting people to use new methods is a heavy lift.

meren commented 7 years ago

Just to add to the history and the evolution of things, that old paper which uses "point mutations in five loci" is very similar to what Carl Woese did in 1985 to partition bacteria into their phyla:

Woese's 'oligonucletide signatures' were characteristic enough to distinguish major bacterial phyla.

colinbrislawn commented 7 years ago

Because we are not necessarily 'denoising' the sequence variant itself. I mean, we are not changing the actual sequence content of the sequence variant, we keep it as it comes out of the sequencer, but we expand its reach by also counting the ones that diverge from it in erroneous ways when we explain its abundance in the data.

This is an excellent distinction! So each read is reported verbatim, while the representative group has been denoised. Each single sequence variant represents a group of denoised reads, just like an OTU centroid.

Finally, I am not sure if voting can capture these subtle discussions. The history teaches us that when people vote without spending enough time to think things through carefully, the outcomes can sometimes be against their long term benefits.

Agreed. Voting often overlooks the implications and can easily become proxy for identity (people vote for their own group). Plus, we can't enforce the result; Robert Edgar is going to keep saying ZOTU.

joey711 commented 7 years ago

@colinbrislawn thanks for the old paper! Here's my interpretation: "oligotype" and "OTU" both are terms that imply a certain biological/taxonomic meaning about the feature (e.g. the "type" portion of oligotype, and prior usage). Whereas, we are looking here for agreement about the technical term that refers to the exact-sequence features, irrespective of downstream interpretation. A good example is the application of amplicon sequencing in non-taxonomic contexts, like synthetic DNA templates, rare cancer or HIV variants, etc. Without checking the literature more formally, I suspect that across DNA-science fields, "sequence variants" easily outperforms "oligotype" in usage and recognizability. If I'm wrong, then I would probably switch my support to oligotype.

I agree with you and @meren that voting is not necessarily a great approach, but I was hoping it might reveal some beginnings of consensus that would be useful. In a sense, though, published articles referring to this type of sequence feature / meaning are a more appropriate poll (but a lot more work for me to aggregate that data).

meren commented 7 years ago

Without checking the literature more formally, I suspect that across DNA-science fields, "sequence variants" easily outperforms "oligotype" in usage and recognizability.

I fully agree.

I would say oligotyping is only one of the methods that give access to amplicon sequence variants. The method differs from other methods since it relies on Shannon entropy calculations as a computational extension of Woese's 1985 approach blah blah. But the product does not need to be called an "oligotype". The more I think about this thanks to this discussion, the more I feel comfortable calling oligotypes amplicon seqeunce variants.

benjjneb commented 7 years ago

Perhaps of interest to this discussion, we have posted a new preprint "Exact sequence variants should replace operational taxonomic units in marker gene data analysis" on the basis of reproducibility and comprehensiveness: http://biorxiv.org/content/early/2017/03/07/113597

Comments and criticisms welcome! We haven't submitted to a journal just yet, so now is the time to let us know what I got wrong or forgot to acknowledge.

On the nomenclature issue, we went with plain-old "sequence variants (SVs)" in this preprint. I believe there are multiple right answers here, and that none of those right answers has "OTU" in it, so we went with the plainest terminology for now.

tseemann commented 7 years ago

I'm very late into this conversation, but I wanted to ask how polyploid bacteria fit into this framework?

For example, Neisseira species often have 2-5 copies of the chromosome within each cell (*). Each copy is (a little bit) different, and different allele proportions affect its antibacterial resistance etc, shown in 23S (and 16S I believe). The chromosomes recombine within the cell etc, used for antigenic variation.

How does this wash out in OTUs, ZOTUs, dadas etc?

I apologise if this question is a bit naive.

(*) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1470461/

benjjneb / dada2

What to call a dada2 sequence? #62