benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Increased Spurious ASVs in Deep Sequencing of Low Diversity Samples? #546

Closed dcat4 closed 6 years ago

dcat4 commented 6 years ago

Hi there,

We've been analyzing a batch of 18S mock communities using DADA2. We put the mock communities together by mixing 22 unique, cloned, full-length 18S amplicons (Sanger sequenced before mixing) in known proportions. We then amplified a smaller hypervariable region with a standard dual-indexing PCR and clean-up protocol and sequenced on an Illumina MiSeq. Each separate mock community was amplified and sequenced in technical triplicates. After processing with DADA2, we found 22 ASVs that exactly matched the full-length Sanger sequences of our 22 target ASVs in all mock communities and almost all technical replicates. However, across all the mock communities, we also found 78 ASVs that were > 97% similar to at least one of our 22 target ASVs. All of these "non-target" ASVs were at extremely low read abundance (< 200) and low relative read abundance in each sample (< 1e-4), and most were found in only one technical replicate of a single mock community. Since we used cloned amplicons rather than genomic DNA, I wouldn't think there would be much variation in the sequence identity of the 22 target ASVs.

My question is then, is it reasonable to suspect we're detecting so many of these spurious/non-target ASVs because we're deeply sequencing (36,000 to >200,000 reads per technical replicate passed through DADA2 for these samples) nearly 30 replicates of the same 22 ASVs on the same MiSeq run? My understanding is that ASV inference is partially dependent on absolute read abundance, and I'd expect the odds of producing a PCR/sequencing artifact at high read abundance increases pretty quickly when you sequence the same 22 ASVs that many times. If that seems like the culprit, do you have any recommendations for parameter values we could change (maybe Omega_A?) to help mitigate this problem?

Any advice would be much appreciated!

benjjneb commented 6 years ago

Most commonly inflation in detected variants over the expected number in mocks is due to contamination, sometimes also unrecognized intragenomic variation. From what you describe in your first paragraph it seems you have at least some evidence that this is not the case, but I would still keep this as a strong possibility.

Another possibility is that these are undetected chimeras.

My question is then, is it reasonable to suspect we're detecting so many of these spurious/non-target ASVs because we're deeply sequencing (36,000 to >200,000 reads per technical replicate passed through DADA2 for these samples) nearly 30 replicates of the same 22 ASVs on the same MiSeq run?

This is reasonable. The phenomenon here is that false positives due to model misspecification are more likely as sequencing depth increases. One thing i have seen is that a single sample might contribute most (or all) of such FPs, perhaps because the PCR step for that sample was wonky or different.

do you have any recommendations for parameter values we could change (maybe Omega_A?) to help mitigate this problem?

First, I would interrogate the dada$clustering data.frame entries corresponding to the FPs. This has a fair amount of diagnostic information that will be very useful in identifying potential issues. Second, I would systematically determine the number of technical replicates in which each non-target ASV appeared. Are they all in just a few?

There are several approaches that can then be taken to mitigate such FPs. Increasing the stringency of OMEGA_A is one. The MIN_FOLD approach is another (see ?setDadaOpt). A simple minimum abundance screen can work. Finally, I might consider inflating the fitted error model, as this is a way to prevent model-misspecification based FPs at any read depth (e.g. err2 <- inflateErr(err, 2)).

dcat4 commented 6 years ago

Thanks so much for the quick and detailed response. I really appreciate the work you all put into making dada2 easy to use and troubleshoot.

It appears you're correct about a single wonky PCR - a single technical replicate was responsible for 52 of the FP's. We'll keep your other recommendations in mind as we parse apart what's going on with the remaining FPs/contaminants. Thanks again!

serine commented 5 years ago

comment had been moved into a separate issue #836

benjjneb commented 5 years ago

@serine Thanks for the detailed post.

Could you repost this as its own issue though? I think it deserves it, and it will prevent me losing track of it, as might happen when its added onto an old closed thread.