benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

How to use controls in multiple small batches over one experiment #132

Open AlexaBennett opened 1 year ago

AlexaBennett commented 1 year ago

I have a two-part question about small batches where processing 5-6 controls is not financially feasible. Context shall follow.

  1. Is the recommendation for 5-6 controls a relatively fixed value or scale compared to the simulation sample size? I.e., Do 30 samples require 5-6 or 2-3 controls?

  2. From this, can one use the prevalence method for an entire experiment if each batch within contains 1-2 controls, but the experiment contains 10-20 process controls?

Now for the critical context. I am multiplexing samples with a panel of amplicon targets on each sequencing run. Due to compositional and financial limits, each batch contains one process control. The process control includes the containers and solution for collection, filtration, extraction, amplification, and sequencing. Throughout the entire experiment, I expect to have approximately 15 to 20 of these small batches. Thus, 15 to 20 controls for each amplicon target. Should I still use a post-hoc method for contextualizing the results regarding possible contaminates, as recommended in issue #38? Or, could I implement decontam (per amplicon target) with the caveat that it is an imperfect representation of the entire dataset?

benjjneb commented 1 year ago

Is the recommendation for 5-6 controls a relatively fixed value or scale compared to the simulation sample size? I.e., Do 30 samples require 5-6 or 2-3 controls?

I would consider 3 the absolute minimum for decontam prevalence to be worthwhile, and 5 a recommended miminum. This holds for small study sizes. Larger study sizes should increase numbers, especially if sampling low biomass environments and over multiple batches.

From this, can one use the prevalence method for an entire experiment if each batch within contains 1-2 controls, but the experiment contains 10-20 process controls?

Yes, just do decontamination on the whole study, ignoring the batches.

Or, could I implement decontam (per amplicon target) with the caveat that it is an imperfect representation of the entire dataset?

Nothing is perfect. I would probably just ignore the batches, use decontam prevalance on the whole study, and then also consider post hoc inspection of any taxa that pop up in my analysis.

AlexaBennett commented 1 year ago

I would consider 3 the absolute minimum for decontam prevalence to be worthwhile, and 5 a recommended miminum. This holds for small study sizes. Larger study sizes should increase numbers, especially if sampling low biomass environments and over multiple batches.

I will keep that in mind for future projects where we will have larger batches.

I would probably just ignore the batches, use decontam prevalance on the whole study, and then also consider post hoc inspection of any taxa that pop up in my analysis.

Perfect, this is the approach I was going to take and it is reassuring to have your input.

My sincerest thanks for the prompt reply!