benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Rarefaction curves and richness estimation from amplicon data #978

Closed manterd closed 4 years ago

manterd commented 4 years ago

I am aware of the many caveats of trying to do OTU richness with molecular microbial data; however, I am curious if the error correction of dada2 is removing too many of the low abundance reads to make rarefaction (or all techniques to assess alpha diversity) impossible.

The attached rarefaction curves are for a soil and zymo mock community analyzed using the dada2 pipeline (Zymo ~ 12 OTUs): Mocks_dada2 rarefaction

The same two samples analyzed using the Mothur pipeline (Zymo ~ 198 OTUs): Mocks_mothur rarefaction

As has been shown before the dada2 pipeline better predicts the richness in the mock community and shows more than an order of magnitude less richness. No matter the sequencing depth of a soil 5,000 to 100,000 reads I see a similar pattern where the rarefaction curve reach an asymptote. All of this raises interesting questions regarding diversity estimates and how we should even treat the rare taxa in our libraries (e.g., McMurdie and Holmes 2014).

Might as well pick on all approaches here too. Even the breakaway and breakaway_nof1 statistics appear to greatly inflate OTU richness regardless of the analytical pipeline.

Anyone else see a similar problem in their data or care to discuss? I am happy to share additional data and thoughts.

Dan Manter USDA-ARS daniel.manter@usda.gov

benjjneb commented 4 years ago

I'm happy for other people to weigh in on this.

Briefly, my thoughts are for richness estimation from amplicon seq data: don't. There is still no method that I think actually captures all the sources of error that exist in identifying rare novel taxa from amplicon data, which is what drives richness estimates.

That said, alpha-diversity metrics like Shannon/SImpson etc. are mostly fine, because they are not driven primarily by extremely rare taxa that appear in just 1-3 reads. Even there, the actual values are not reliable due to biases in how well different taxa are detected by sequencing methods, but comparisons between samples within a study will have some amount of correlation with the truth. That is, this set of samples has higher measured alpha-diversity than this other set of samples will correspond to the truth more often than not.

Andreas-Bio commented 4 years ago

I am comparing samples of different sequencing depth and I need to show they all have been sampled sufficiently. Is it even possible to do a rarefaction curve in dada2? I don't think so, because the singletons are all kicked out by default. So if a sample has a lot of singletons it is a strong indicator more sampling would reveal more information, no? But if I take the same sample after dada2 it looks perfectly fine because all the singletons are gone, my rarefaction curve will be very flat, while before dada2 it be rising almost linear.

benjjneb commented 4 years ago

Rarefaction curves are essentially a way to try to estimate richness (i.e. the total taxa in the sampled community). An absence of singletons does violate the assumptions behind rarefaction curve extrapolation methods, but so do the present of uncorrected errors and mis-annotated "novel" taxa (which e.g. can entirely explain a "linear" rarefaction curve prior to dada2).