benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
145 stars 24 forks source link

Best way to use controls for multiple batches? #135

Open DrLCode opened 1 year ago

DrLCode commented 1 year ago

Hi everyone,

I'm planning on using Decontam (prevalence approach) to identify and filter possible contaminants from skin microbiome sequencing data and was wondering if anyone would be able to provide some input as to the best way to utilise my controls for this. For context I have ~120 animal samples, a negative control collected each day of sample collection (N=9), a negative extraction control from each extraction batch (N=13) and a PCR/sequencing control from each sequencing batch (N=2).

I had originally planned to subset my data manually into the appropriate batches, run decontam, prune and reassemble before repeating for the next set of controls (i.e. decontam the 2 sequencing batches separately, then decontam the 13 extraction batches separately and finally decontam the 9 collection days separately). From reading available information and looking through this forum it seems that plan would be ill advised on account of there being only one control in each analysis.

These samples are from a multiple animals with different characteristics, and collected on separate occasions, so I'm apprehensive to run decontam on all samples and all controls together, as a legitimate contaminant that appeared in a single control and all the samples that control directly applies to could be lost amongst the rest.

Would splitting my samples into subsets that include all 3 controls associated with each sample (creating lots of very small subsets) be an appropriate approach?

Any comments are much appreciated!

benjjneb commented 1 year ago

It is not advised to split into many small subsets.

The prevalence method relies on multiple negative control samples to have the statistical power to effectively discriminate between contaminants and non-contaminants. Although the animals might be different, if you are using the same measurement protocol throughout, the contaminants being introduced should be consistent, and that is what is important to the decontam method.

Furthermore, the most effective negative controls to use with decontam (or other contaminant ID methods) are those that went through as much of your measurement methodology as possible, ideally a sterile sampling instrument that was exposed to the sampling environment but used to actually perform sampling. That sounds like it might correspond to your "negative control collected each day of sample collection". Other types of negative controls introduces later are not as effective, as they can only inform about contaminants introduced after that step in the measurement process.