Which specimens to include in analysis

tellafiela commented 5 years ago

We have used the decontam package to identify contaminants from our nasopharyngeal and induced sputum dataset (V4 Illumina MiSeq). Since these are generally low biomass specimens: 1) We only included specimens with higher biomass (specimens that did not clustered with negative controls) to represent biological specimens. Is this acceptable? 2) We have measured 16S rRNA gene concentrations (ng/ul) and 16S rRNA gene copy numbers (copies/ul) for both biological specimens and negative controls from template prior to amplification. Which would you recommend to use as input value? 3) Since this is a validation dataset, we included duplicates or triplicates of each of the biological specimens (template amplified and sequenced more than once). Some biological specimens had higher biomass than other but all of these were considered "biological representatives" as they did not cluster with negative controls. Would the inclusion of duplicates/triplicates interfere/bias our results from the decontam package in any way? Should we rather only include one representative of each biological specimen?

Thanks in advance!

benjjneb commented 5 years ago

We only included specimens with higher biomass (specimens that did not clustered with negative controls) to represent biological specimens. Is this acceptable?

Are you using the "prevalence" method? If so that is OK. If using the "frequency" method you should keep the full range of samples.

We have measured 16S rRNA gene concentrations (ng/ul) and 16S rRNA gene copy numbers (copies/ul) for both biological specimens and negative controls from template prior to amplification. Which would you recommend to use as input value?

Please see this comment on an earlier issue: https://github.com/benjjneb/decontam/issues/33#issuecomment-436658460

Does that answer your question?

Since this is a validation dataset, we included duplicates or triplicates of each of the biological specimens (template amplified and sequenced more than once). Some biological specimens had higher biomass than other but all of these were considered "biological representatives" as they did not cluster with negative controls. Would the inclusion of duplicates/triplicates interfere/bias our results from the decontam package in any way? Should we rather only include one representative of each biological specimen?

When considering lab/reagent contaminants, including those duplicates/triplicate samples is perfectly fine I think, as they still will have reasonably independent "contamination processes", which is what is important in this case.

tellafiela commented 5 years ago

We only included specimens with higher biomass (specimens that did not clustered with negative controls) to represent biological specimens. Is this acceptable?

Are you using the "prevalence" method? If so that is OK. If using the "frequency" method you should keep the full range of samples.

We are using a combination of the prevalence and frequency methods.

We have measured 16S rRNA gene concentrations (ng/ul) and 16S rRNA gene copy numbers (copies/ul) for both biological specimens and negative controls from template prior to amplification. Which would you recommend to use as input value?

Please see this comment on an earlier issue: #33 (comment)

Does that answer your question?

I was actually wondering if ng/ul vs 16S copies/ul would be a better measure of input DNA for TEMPLATE. We quantify template using a 16S qPCR. Regarding the template verus post-amplification concentrations (post-amplification concentrations are measured using dsDNA dye called Quantifluor) - would you suggest using either of these then (based on comment #33)?

Since this is a validation dataset, we included duplicates or triplicates of each of the biological specimens (template amplified and sequenced more than once). Some biological specimens had higher biomass than other but all of these were considered "biological representatives" as they did not cluster with negative controls. Would the inclusion of duplicates/triplicates interfere/bias our results from the decontam package in any way? Should we rather only include one representative of each biological specimen?

When considering lab/reagent contaminants, including those duplicates/triplicate samples is perfectly fine I think, as they still will have reasonably independent "contamination processes", which is what is important in this case.

So each sample gets evaluated independently and then based on these findings the "contaminants" are identified from the entire dataset - is this correct? We are using the prevalence and frequency method combined.

benjjneb commented 5 years ago

We are using a combination of the prevalence and frequency methods.

I would keep all the sample then.

I was actually wondering if ng/ul vs 16S copies/ul would be a better measure of input DNA for TEMPLATE. We quantify template using a 16S qPCR. Regarding the template verus post-amplification concentrations (post-amplification concentrations are measured using dsDNA dye called Quantifluor) - would you suggest using either of these then (based on comment #33)?

I think both will work well, I don't think there is any important difference in this case in DNA concentration vs. 16S concentration, since either way you are sequencing 16S DNA in the end.

So each sample gets evaluated independently and then based on these findings the "contaminants" are identified from the entire dataset - is this correct? We are using the prevalence and frequency method combined.

I think no. Contaminants are identified based on patterns over the whole dataset. My comment was that, implicit in that identification, there is an assumption that contaminants are introduced relatively independently among samples. Replicate biological samples don't interfere with the independence of contaminant introduction, thus is no concern for decontam.

benjjneb / decontam

Which specimens to include in analysis #46

We are using a combination of the prevalence and frequency methods.