benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

DNA concentrations and two-step PCR #84

Closed smonteux closed 3 years ago

smonteux commented 4 years ago

Hi and thanks for a great package,

I am wondering about which DNA concentrations should ideally be used in the frequency-based approach. It seems to me that using qPCR data would be quite similar to using initial DNA extracts concentration, but that could be quite distinct from using the input PCR template (because of varying dilution factors between samples to decrease inhibition) or the post-PCR concentrations. I assume that comparing between those different concentrations could allow for determining contaminants introduced during the sampling/extraction versus those introduced from e.g. PCR reagents. So I thought that ideally OTUs identified by using any of those DNA concentrations as an input should be removed - in case different ones are identified when using different concentration measurements -, have you perhaps looked into such differences since the package was released?

I was also wondering how to proceed when using two-step PCR: DNA concentrations after the first PCR may be relevant, but since the amplicons are normalized before running the indexing PCR these concentrations would be quite distinct from what ends up being sequenced. The concentrations after the indexing PCR might be more relevant, but since they should be relatively identical across samples they may not be so informative about contaminants, have you by any chance explored that topic?

Thanks for the input!

benjjneb commented 4 years ago

At this point, the relative efficacy for decontam of DNA concentration measurements at different stages (e.g. qPCR vs. DNA concentrations measured prior to equimolar pooling of samples in the sequencing library) has not been rigorously evaluated. Less rigorous is the observation that concentration measurements at different stages, provided that they occur prior to attempting to normalized DNA across samples by sample-specific dilution/enrichment, all seem to work, at least for the most abundant and most problematic contaminants.

We think the reason for this is two-fold: Although far from perfect, the relationship between total input DNA and DNA at various stages in the library construction process does seem to be preserved to some degree, prior to normalization for equal sampling depths. See here for more on that: https://www.biorxiv.org/content/10.1101/2020.02.03.932301v2 Second, the concentration-based signature of major reagent contaminants can be so strong, that even large deviations from the linearity of concentration measurements can be tolerated by our classification procedure.

Hope that helps! You are asking excellent questions.

smonteux commented 3 years ago

Thanks Ben, that's indeed helpful! I haven't tried out with the frequency method yet as the dataset I have now was a bit of mix-up of Qubit and Nanodrop, but I tried a quick comparison of the prevalence method at a couple of different thresholds and the method I had been using until now which was basically picking OTUs making up more than 5% of the reads of any control sample. contaminants_prevalence I had only 4 negative controls and there wasn't anything present in more than 2, the prevalence-based method picked up some more likely contaminants in the bottom, but not the top 2 OTUs which are in many samples and are more likely cross-contamination. Since they all amount to very few reads and I can't check with the frequency-based method I'll just be conservative and throw them all away, but I thought I'd share the results anyway, might be useful to someone.

Anyway, thanks again for your insights and for the package, it's very user-friendly!