benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

Prevalence filtering after decontam? #15

Closed sachasuca closed 6 years ago

sachasuca commented 6 years ago

Thanks for creating another great R package!

In the preprint, you suggest combining decontam with low-prevalence filtering as a way to effectively remove contaminants. I assume you are suggesting low-prevalence filtering should occur after removing sequences identified as contaminants using decontam. Is my understanding correct? Is there any situation in which you could envision applying the low-prevalence filtering before using decontam?

Finally, from what I can tell setting a low-prevalence filter (0.001%, 0.1%, 1%, etc.) is a bit a arbitrary. Other than visualizing prevalence (as suggested in Fig. 3 in the Bioconductor Workflow for Microbiome Data Analysis), do you know of any ways to make this decision less arbitrary/ensure it will not introduce bias into a downstream analysis such as differential abundance?

Thanks!

P.S. Please let me know if you prefer another forum to ask non-technical/non-code-related questions.

benjjneb commented 6 years ago

I assume you are suggesting low-prevalence filtering should occur after removing sequences identified as contaminants using decontam. Is my understanding correct?

To clarify, when we say a low-prevalence filter, we mean removing features (e.g. ASVs/OTUs/taxa) that appear in less than [CUTOFF] number of samples. For example, we often remove features that only appear in one sample, as they are generally irrelevant to later statistical analyses, and we can't determine whether such features are contaminants or not. Also, when we remove low-prevalence features, we remove them from the whole table, not on a per-sample basis.

That can be done before or after decontam, as decontam also acts on a per-feature basis, so removing some features beforehand doesn't affect how the rest are classified.

Low-prevalence filter thresholds are a bit arbitrary, but they won't bias downstream analyses of differential abundance. We are not suggesting setting low-frequency variants to zero on a per-sample basis, we are suggesting removing features that appear in very few samples from the table entirely.

Here is a paper that describes the basic rationale: Independent filtering increases detection power for high-throughput experiments

ps: This is the perfect place for these questions. We use the github issues forum as both an issues tracker and as a broader user support/discussion platform.