benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

Singletons #92

Closed MartonSzoboszlay closed 8 years ago

MartonSzoboszlay commented 8 years ago

Hello, this may be a rather naive question, but I don't really understand how singletons are handled by DADA2. The DADA2 manuscript states that singletons can't form their own partitions, still the final output of the DADA2 pipeline (default settings following the tutorial) from one of my datasets contained one sequence variant with only a single sequence across all samples. How is this possible? The DADA2 manuscript draws a parallel between UPARSE and DADA2 in regard of how singletons are handled. In the UPARSE pipeline singletons are removed before forming OTUs, so they won't seed new OTUs, but then allowed to bin to OTUs if they show enough sequence similarity. Those singletons however, that can't join any OTU are deleted. In DADA2 singletons can't form their own partitions, but what happens with those that don't fit well into any of the partitions? Are they removed, or are they left in the partition that is the best, yet sub-optimal fit? Thanks for the help!

MartonSzoboszlay commented 8 years ago

An other interesting feature of the DADA2 output I don't understand is that there are no sequence variants that would have a single sequence in one sample. Why doesn't the DADA2 pipeline allow the formation of such sequence variants? In other words, in the sequence table, there are no cells with '1' (except from that one singleton I found in the data set mentioned above).

benjjneb commented 8 years ago

Hi Marton, To answer your first question, singletons cannot appear in the raw output of the dada(...) function, but can appear after merging, if for example the reverse read of one member of a doubleton was misassigned due to low quality. This is relatively rare, but possible.

Your second question is an important point that, unfortunately in retrospect, was not adequately covered in our manuscript. The issue here is whether or not you pool your samples for sample inference, in which case per-sample singletons are called but not whole-study singletons, or if you process them independently, in which case no per-sample singletons are called.

There is some discussion of this on our website:

De novo OTU methods must pool samples before processing them, as without pooling the labels between samples are not consistent and cannot be compared, i.e. OTU1 in sample 1 and OTU1 sample 2 won’t be the same. DADA2 resolves sequences exactly, and because exact sequences are consistent labels samples can be processed independenlty (and this is the default behavior).

Independent sample processing has two major advantages: Computation time is linear in the number of samples, and memory requirements are flat with the number of samples. However, pooling allows information to be shared across samples, which makes it easier to resolve rare variants that were seen just once or twice in one sample but many times across samples. Pooled sample inference is also supported by calling dada(..., pool=TRUE).

See more here: http://benjjneb.github.io/dada2/pool.html

Happy to answer further questions on this, as the documentation needs some improvement on this issue.

MartonSzoboszlay commented 8 years ago

Hi Ben, thank you very much for the explanation! From diverse communities I think we can expect quite a few true singletons in the dataset unless sequencing was extremely deep. Do I understand it right that because DADA2 is unable to resolve these true singletons, they will be left binned into other sequence variants, most likely the rare ones, like doubletons? If this is so, can we trust the rare sequence variants in the DADA2 output, or would it be a better practice to delete them, cause most of them are likely messy? Pooling may get around this problem for the 'sample-wise' true singletons, but not for the 'dataset-wise' true singletons. Considering that we often can't afford to work with a large number of replicates, dataset-wise true singletons may also not be very rare.

benjjneb commented 8 years ago

Filtering based on overall frequency is always a valid option, no matter what bioinformatic method you are using, because the rarest variants are the hardest to call.

DADA2 does not call singletons because of how difficult it is to robustly distinguish between real singletons and singleton errors. As you've pointed out that will leave real singleton reads clustered in with another real sequence variant. If forward and reverse reads of that singleton are clustered in the same way it will result in an extra count for that other variant. This effect is typically minor, although worth being aware of if you are interested in precisely quantifying the rarest variants. (If this is of particular interest to you, the dada(...) diagnostics can be used to address this issue, although it will require some non-trivial R coding to implement).

On this topic, I would caution against some of the conventional wisdom on the numerical importance of rare variants. Some microbial communities are quite diverse, but the high FP rate of standard OTU methods has given a broadly misleading impression as to the numbers of rare variants. See for example the thousands of OTUs output by standard methods applied to mock communities with tens of strains (http://dx.doi.org/10.1128/mSystems.00003-15). Those thousands of spurious OTUs are almost all miscalled rare variants, and miscalled rare variants are just as prevalent when sequencing real communities as they are in these mocks.

This is starting to lead me into rant territory on the terribly misleading estimates of richness that are constantly published from amplicon data, so I'll stop here!

MartonSzoboszlay commented 8 years ago

I guess I forgot about how clustering the forward and reverse reads separately influence all this. So, with paired-end data the real singletons don't bother a lot. I agree. Richness metrics that are useful in other fields of ecology often don't work well for NGS datasets, and inference from rare OTUs come with big uncertainty. Still, it is good to know how they are handled, so thanks for explaining!