benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
144 stars 24 forks source link

When are assumptions for decontam violated? #120

Open pslai opened 1 year ago

pslai commented 1 year ago

I am doing an experiment where we are testing 5 different ways to deplete host DNA on three types of low microbial biomass respiratory samples (sputum, nasal swabs, bronchoalveolar lavage) prior to deep metagenomics sequencing. % host DNA ranges from 16.2% for some host DNA depletion methods to 82% for untreated samples, with a range in between. When I run decontam, it identifies some microbial species that would be unusual as a kit or environmental contaminant.

I'm wondering if the nature of our study design violates the assumptions decontam is operating on to detect contaminants. Decontam assumes (1) Sequences from contaminating taxa are likely to have frequencies that inversely correlate with sample DNA concentration and (2) sequences from contaminating taxa are likely to have higher prevalence in control samples than in true samples. Since we are doing metagenomics sequencing on respiratory samples with high host DNA, I worry that assumption # 1 doesn't hold. Some host depletion methods are very effective, and since we attempted to sequence to the same depth for all samples, the host depleted samples end up with a much deeper "effective" sequencing depth when we remove host reads from our sequencing data.

Any suggestions on whether we can use decontam at all, or whether we can tweak decontam parameters to address these issues? We could modify our sample DNA concentration to calculate "microbial" sample DNA concentration using our % host DNA calculations from either sequencing, or from 16S qPCR.

benjjneb commented 9 months ago

Great question and I apologize for missing this when it would have been timely.

Decontam's assumptions are that the input DNA is a mixture of sample DNA and contaminant DNA (S+C), and that the DNA concentration measurements are quantifying the concentration of that mixture. As you've identified, there is a potential issue here -- host DNA. In a "standard" workflow that isn't a problem, the host DNA is just part of the sample DNA (albeit unwanted) and so the decontam assumptions are OK. However, in your study you are using a host depletion step. That may be an issue if DNA concentration measurements that are being used are taken prior to host depletion. Then, the mixture is not S + C, but S + C + H, and the H was removed prior to the sequencing but after DNA quantitation. The prevalence method could also be affected if negative controls don't go through the same host depletion process.

We could modify our sample DNA concentration to calculate "microbial" sample DNA concentration using our % host DNA calculations from either sequencing, or from 16S qPCR.

In theory, this sounds reasonable.