benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
145 stars 24 forks source link

Selection of negative controls and biological specimens #31

Open tellafiela opened 5 years ago

tellafiela commented 5 years ago

I am exploring the use of the Decontam package to remove contaminants from our dataset. We are analysing nasopharyngeal specimens with a wide range of concentrations (median 16S rRNA gene Cq values ranging between 19 to 34 - corresponding to median 16S rRNA concentrations of 1.6 to 0.00002 ng/ul).

We also included negative controls spiked with DNA from a known organism which was serially diluted (Cq values ranging between 17 to 28 (concentration ranging between 8.2 to 0.00056 ng/ul).

In addition, we also included negative controls without DNA spike-in (Cq values around ~34; concentration ~0.00012 ng/ul) which would in general be referred to as the no-template control.

Could you kindly assist in letting me know what approach would be best in given this information: 1) Should we include all samples (Cq values ranging between 19 and 34) as "biological samples" when using decontam or only a subset (for example, samples with Cq values <26) to better represent "biological profiles" as samples with high Cq values may only represent contaminants? 2) If yes, to 1) - how do we select a cutoff? Would the "expect library sizes" below be a good approach? image.png 3) Would you recommend using the negative controls without spike-in DNA (n=41) OR negative-controls with spike-in DNA (at the lowest concentration) n=7 as "negative controls" when using the package? 4) Would it be useful to "validate" our results obtained from 1) and 2) by specifying the negative controls with spike-in DNA (entire range of dilutions) as "biological samples" and negative controls without spike-in DNA as "negative controls" using decontam?

benjjneb commented 5 years ago
  1. Should we include all samples (Cq values ranging between 19 and 34) as "biological samples" when using decontam or only a subset (for example, samples with Cq values <26) to better represent "biological profiles" as samples with high Cq values may only represent contaminants?

The full range should be OK, the model selection approach used in decontam is relatively robust to some deviation from the assumptions (e.g. that sample DNA concentration S > C contaminant DNA concentration) over a portion of the data. However, I would make sure that at least most of the data being used is in the S > C regime. Also, note that you need to use the estimated concentrations as input to the frequency method, not the Cq values (as they are not proportional to the concentration).

  1. If yes, to 1) - how do we select a cutoff? Would the "expect library sizes" below be a good approach?

You could set a cutoff where the concentrations of real samples start overlapping the concentrations of true negative controls, as that suggests S ~ C at that point.

  1. Would you recommend using the negative controls without spike-in DNA (n=41) OR negative-controls with spike-in DNA (at the lowest concentration) n=7 as "negative controls" when using the package?

I would just use the non-spike in controls, especially given that you have a good number of them (41).

  1. Would it be useful to "validate" our results obtained from 1) and 2) by specifying the negative controls with spike-in DNA (entire range of dilutions) as "biological samples" and negative controls without spike-in DNA as "negative controls" using decontam?

I'm not sure I understand this, but I think you might be suggesting to identify contaminants using the frequency method over the real samples, and the prevalence method using the real samples and negative controls (both excluding the spike-in controls), and then evaluating the contaminant identifications on the spike-in controls? If so, that seems like a good strategy to me, and should give you a sense of the fraction of contaminant reads that are being removed. Also, if a significant fraction of contaminant reads fail to be removed, it may indicate substantial cross-contamination in your experiment, which decontam does not effectively remove.