benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
144 stars 24 forks source link

A LOT of contamination in controls #109

Open sash2566 opened 2 years ago

sash2566 commented 2 years ago

Hi I am very new to this so I might have some very basic questions to ask.

I have quite a lot of contamination in my control and am therefore, trying to use decontam (prevalence-based) to workout how to take care of it. This is what the library size looks like: image

This is what the prevalence plot looks like: image

I used a threshold of 0.5: image

I then removed all ASVs that did not pass this threshold. However, when I look at the otu table, I still see a lot of reads for the controls as compared to the samples for some ASVs. Although this sort of makes sense as some controls have a much bigger read count for some ASVs as compared to the samples, but then I don't know what they're still there, if they are not that "prevalent" in the samples. What do you suggest I do then? Should I delete all ASVs that are there in the controls from the otu table or is there another way to delete the contamination? I do not want to get rid of "real" data but I also do not want to be keeping contamination.

benjjneb commented 2 years ago

I have not worked with datasets in which the negative controls had substantially higher read counds on average than the real samples. Just want to say that up front, because there may be something going on here that I don't understand.

That said, the histogram of decontam scores you plotted is highly concordant with the "mixture" model of contaminant and non-contaminants that decontam is based on. There is a low score mode of contaminants, and there is a >-.5 score mode of non-contaminants. Really any threshold between 0.1 and 0.5 is justified by that histogram, with preference for "stringency" in removing contaminants versus avoiding false-positive contaminant identifications a valid way to choose within that range.

Should I delete all ASVs that are there in the controls from the otu table or is there another way to delete the contamination?

This approach becomes really problematic when cross-contamination between samples is present, which is usually is to varying degrees. The most abundant real ASV in the real samples is also the most likely one for cross-contamination to insert into the negative controls. I would not recommend the blanked deletion of all ASVs detected in controls here.

I do not want to get rid of "real" data but I also do not want to be keeping contamination.

There is no perfect decontamination, but from what I see here, removing those with decontam scores < TRHESHOLD, with TRESHOLD between 0.1 and 0.5, looks like an effective approach.