benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

Distribution of scores assigned by decontam #70

Closed cdeanj closed 4 years ago

cdeanj commented 4 years ago

Hi Dr. Callahan,

I have sequenced negative controls and generated 16S qPCR values and used them as input to decontam in the following way:

contamdf.comb <- isContaminant(ps, method="combined", neg="is.neg", conc="CopyNumber")

Following this, I inspected the distribution of the composite scores assigned by the isContaminant function:

hist(contamdf.comb$p, 100)

hist

The scores appear to display a bimodal distribution with peaks at 0.8 and 1.0, indicating that most of the ASVs fall within this high score range and are likely not contaminants. Would I be justified in choosing a threshold between 0.1 and 0.6 to remove the putative contaminants? Just want to make sure I understand the purpose of this parameter.

Thanks! Chris

benjjneb commented 4 years ago

Yep, I concur with your analysis.

I'd probably pick a more stringent threshold (e.g. 0.6) or more relaxed threshold (e.g. 0.1) depending on whether it was more important to my study to maximally control contamination even at the risk of losing a few real taxa (stringent), or to remove the most egregious contaminants even at the risk of letting a few contaminants through (relaxed).

cdeanj commented 4 years ago

Hi Dr. Callahan,

I have an additional question regarding the histogram of composite scores I presented a couple of days ago.

I take the two peaks to correspond to the the composite scores generated by the contaminant and non-contaminant models, where the far right peak corresponds to ASVs better explained by the non-contaminant model and the left peak corresponding to ASVs better explained by the contaminant model.

If this interpretation is correct, why are the composite scores surrounding the left peak so high? I would have assumed that they would have been lower, since lower scores indicate that the contaminant model is a better fit.

Thanks! Chris

benjjneb commented 4 years ago

If this interpretation is correct, why are the composite scores surrounding the left peak so high? I would have assumed that they would have been lower, since lower scores indicate that the contaminant model is a better fit.

All scores over 0.5 indicate that the non-contaminant model was a better fit. What you see here are two score modes, both of which are better fit by the non-contaminant model. My first guess is that the "lower-score" mode of ~0.85 might be ASVs that appear in fewer samples, and for that reason the non-contaminat model can't be preferred as strongly, but is still preferred.

When contamination is a major factor, the contaminant mode will have a mode <0.5.