benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

DADA2 Running Isolate and Environmental Samples Together #851

Closed tucker4 closed 5 years ago

tucker4 commented 5 years ago

Recently our lab conducted an experiment where we isolated strains of marine bacteria from seawater inoculum. We sequenced these cultures, a sub-sample of the original seawater inoculum, and additional environmental seawater samples from a time-series of the same location on a single MiSeq amplicon run using the 16S V4-V5 primers.

I used DADA2 to denoise our sequences and conduct preliminary analyses. One concern we have had in this process is that we are using DADA2 to evaluate sequencing error across very different types of samples (from very low diversity (1-10 ASVs expected, but with high read abundance) to high diversity (1,000-10,000’s of ASVs expected, with variable read abundance)). It looks like the Extreme Mock Community examined in the DADA2 paper could be a somewhat close, although conservative, example of our dataset. I was wondering if you foresee DADA2 having any trouble with our dataset and should we remove the culture samples from the dataset before running the denoising step? Or, are there any particular parameters (e.g. omega A) you might recommend we change to help DADA2 handle this variation in expected ASV diversity and more specifically how we should change these?

Any guidance would be greatly appreciated! Thank you!

benjjneb commented 5 years ago

The most important questions is: Were all samples processed in identical fashion? In particular, were the PCR, library preparation and sequencing performed identically across all samples?

If the answer to that is yes, you should be fine to process the samples together even if they have very different underlying community distribution. Fundamentally, DADA2 is modeling the error process and as long as the error process is consistent it should work well across different sample types.

The one practical consideration we have noticed is that estimation of the error model is easier on low-medium diversity samples than on very high diversity samples, so that may be something to consider (i.e. running learnErrors on the lower diversity samples).

tucker4 commented 5 years ago

Thank you for your helpful response, I will look into the learnErrors.

For the most part, the samples were prepared in an identical fashion. The only difference was that the isolates were normalized into one pool (at 240ng each for sample) and the environmental samples normalized into another pool (also at 240ng for each sample), and then we added these two together so that environmental samples were at a high fraction of the final library than the isolates. The goal of this was to increase the number of sequences for environmental samples, given their high diversity, and decrease the number of sequences for the isolates, given their low diversity. Does that make sense? Is this something that could have an influence on denoising?

Thank you very much for your input! Sarah

benjjneb commented 5 years ago

The only difference was that the isolates were normalized into one pool (at 240ng each for sample) and the environmental samples normalized into another pool (also at 240ng for each sample), and then we added these two together so that environmental samples were at a high fraction of the final library than the isolates. The goal of this was to increase the number of sequences for environmental samples, given their high diversity, and decrease the number of sequences for the isolates, given thei

Nope, you should be fine. The "noise" is added by the PCR and sequencing steps, not the normalization step.