benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

prevalence method in Decontam not Identifying contaminants with a single control sample #152

Open JayalalKJ opened 1 week ago

JayalalKJ commented 1 week ago

Hi, I have one control sample, and the prevalence method in Decontam is not effectively identifying contaminants. The p-values seem distributed evenly, and the isContaminant() function isn’t marking many sequences as contaminants even with an aggressive threshold. I need advice on how to proceed or any alternative approaches or changes to the following code.

attempted -> (e.g., using threshold=0.3).

identification of control samples sample_data(physeq)$is.neg <- sample_data(physeq)$Sample_or_Control == "Control Sample"

Identify contaminants using the prevalence method with an aggressive threshold contamdf.prev <- isContaminant(physeq, method = "prevalence", neg = "is.neg", threshold = 0.3) table(contamdf.prev$contaminant)

visualize prevalence in positive vs negative controls ps.pa <- transform_sample_counts(physeq, function(abund) 1 * (abund > 0)) ps.pa.neg <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "Control Sample", ps.pa) ps.pa.pos <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "True Sample", ps.pa)

Create a data frame for visualization df.pa <- data.frame(pa.pos = taxa_sums(ps.pa.pos), pa.neg = taxa_sums(ps.pa.neg), contaminant = contamdf.prev$contaminant)

Plot the prevalence of taxa in positive vs negative controls ggplot(data = df.pa, aes(x = pa.neg, y = pa.pos, color = contaminant)) + geom_point() + xlab("Prevalence (Negative Controls)") + ylab("Prevalence (True Samples)")

Prune contaminants physeq_clean <- prune_taxa(!contamdf.prev$contaminant, physeq)

benjjneb commented 1 week ago

I have one control sample

decontam-prevalence is not an appropriate method for use when you have only one negative control sample. It relies on repeated observation of contaminants across multiple negative controls. We recommend a minimum of 5, see our original paper for more on that. https://doi.org/10.1186/s40168-018-0605-2