benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

decontam removing ASVs of potential importance? #41

Open EmmaDietrich opened 5 years ago

EmmaDietrich commented 5 years ago

Hello, I have 6 negative controls across extraction batches for about 200 samples - I knew this might be too few negative controls for prevalence filtering in decontam to identify all potential contaminants, but I thought decontam would still be able to identify some of the major contaminants. The problem is, it seems some of the contaminants it's identifying are cross-contaminants of my negative controls from my samples of interest. I know this b/c they are ASVs which belong to obligately intracellular endosymbiotic bacteria (Ehrlichia, Wolbachia, and Spiroplasma, specifically) - these are bacteria that do not live outside of host cells, and different ASVs of these genera are also found in my specimens (I am looking at the bacterial communities of an invertebrate).

What I'm asking then, I guess, is would you recommend continuing with these cross-contaminant ASVs removed, given that they have relatively low prevalence in my samples? Or, can I justify adding them back to the dataset, while leaving out other bacteria removed by decontam? That doesn't feel right to me, but some of the ASVs removed by decontam are clear contaminants (e.g. Propionibacterium) that skew my results (some samples have low bacterial biomass so are affected more by contaminants), so I would like to keep them out of analyses. I have considered decreasing the prevalence threshold for identifying contaminants, but am unsure, then, what the threshold should be.

I think that probably the best option would be to follow the advice you give in this thread, which is to include all ASVs in analyses and address those that might be contaminants after running stats. But, curious if you all have any other ideas for getting around this issue.

Thanks for your help!

benjjneb commented 5 years ago

What I'm asking then, I guess, is would you recommend continuing with these cross-contaminant ASVs removed, given that they have relatively low prevalence in my samples? Or, can I justify adding them back to the dataset, while leaving out other bacteria removed by decontam?

Yes, that is absolutely justifiable. What you are doing is using expert knowledge about your specific system to interpret the results from a general purpose algorithm (decontam) that doesn't have any special knowledge about your experiment. Now, you should record and report what you do, but this is not some invalid thing you are suggesting.

I think that probably the best option would be to follow the advice you give in this thread, which is to include all ASVs in analyses and address those that might be contaminants after running stats.

This is also a useful strategy, particularly for interpreting taxa-level analyses such as differential abundance analysis (less useful for community-level analyses).

The problem is, it seems some of the contaminants it's identifying are cross-contaminants of my negative controls from my samples of interest. I know this b/c they are ASVs which belong to obligately intracellular endosymbiotic bacteria (Ehrlichia, Wolbachia, and Spiroplasma, specifically) - these are bacteria that do not live outside of host cells, and different ASVs of these genera are also found in my specimens (I am looking at the bacterial communities of an invertebrate).

Would you be willing to share this data with us to take a look at? decontam was designed in part to avoid calling cross-contaminants from real samples as contaminants, so I'm interested in taking a closer look at what sounds like a failure in that regards.

EmmaDietrich commented 5 years ago

I could share the data somehow - would you want files I use to make the phyloseq object? Or something else? I actually "combined" two projects to ask my question - but I can send you the files for one of the projects.

However, I'm thinking it's really just a matter of not having enough negative controls for my dataset (my problem, not a decontam problem). If, say, one Ehrlichia ASV got into a single negative control (n=6) through cross-contamination, but is prevalent in only twenty samples (n=200) b/c that bacteria isn't widespread throughout populations, then it's going to be "more prevalent" in the negative samples right (1/6 versus 1/10)? Or am I oversimplifying? That's why I was curious if you ever recommend changing the prevalence threshold from 0.5, to only identify ASVs that are present in, say 3/4 of negative controls.

I think the best way forward for me is to not remove any ASVs from statistical analysis, but instead to identify the potential contaminants using decontam, and use that information to inform qualitative analysis of results. I also may run analyses with and without potential contaminants removed to see if there are any large differences.

Thanks for your help!

benjjneb commented 5 years ago

I think the best way forward for me is to not remove any ASVs from statistical analysis, but instead to identify the potential contaminants using decontam, and use that information to inform qualitative analysis of results. I also may run analyses with and without potential contaminants removed to see if there are any large differences.

That sounds like a good plan to me!

I could share the data somehow - would you want files I use to make the phyloseq object? Or something else? I actually "combined" two projects to ask my question - but I can send you the files for one of the projects.

Just the phyloseq object itself would work, perhaps saved as an RDS object (saveRDS).

However, I'm thinking it's really just a matter of not having enough negative controls for my dataset (my problem, not a decontam problem). If, say, one Ehrlichia ASV got into a single negative control (n=6) through cross-contamination, but is prevalent in only twenty samples (n=200) b/c that bacteria isn't widespread throughout populations, then it's going to be "more prevalent" in the negative samples right (1/6 versus 1/10)? Or am I oversimplifying?

The prevalence method should be accounting for the higher noise in a lower number of controls and cautiously not calling that example ASV a contaminant. Which is why I'm curious about what exactly is going on in this case.

That's why I was curious if you ever recommend changing the prevalence threshold from 0.5, to only identify ASVs that are present in, say 3/4 of negative controls.

Our general recommendation is to plot a histogram of the scores (e.g. hist(contamdf$p, 100)) and strongly consider adjusting the classification threshold to capture the low-score mode as precisely as possible. I wonder, could you post the histogram of scores in your case? That might tell quite a bit on its own.

EmmaDietrich commented 5 years ago

Here is the code and histogram for one of my simpler datasets with 4 negative samples and 197 non-negative samples. The mode is ~0.5. This is after removing some things like any ASVs identified as chloroplast dna, but this is before filtering out ASVs with low prevalence (i.e. present in <1% of samples).

sample_data(phy_WC)$is.neg <- sample_data(phy_WC)$SpecimenType == "Negative" contamdf.prev_wc <- isContaminant(phy_WC, method="prevalence", neg="is.neg",threshold=0.5) hist(contamdf.prev_wc$p, 100, ylim = c(0,400), xlim = c(0,1))

histogram_for_decontam

At threshold = 0.5, I find that 74 ASVs are identified as contaminants and 8519 ASVs not as contaminants. One of the listed contaminants is an Ehrlichia sp. which doesn't live outside of cells, and is what made me concerned I might be doing something wrong. I've attached the RDS as a .zip, where the negative samples are "Negative" under the metadata category SpecimenType. All negative controls did not test positive for bacterial presence using PCR screens of the 16s rRNA gene, and had essentially non-existent amounts of DNA using Qubit to measure DNA concentrations, but library prep and sequencing were completed somewhere else.

for_decontam.zip .

benjjneb commented 5 years ago

That histogram is very useful. For me, when I look at that, I see a clear low score mode at scores from 0.0 to 0.1. Hence, typically I would set the threshold at 0.1 or maybe 0.2. A threshold of 0.5 seems aggressive given this distribution, and the modest number of negative controls you have.

However, carrying these scores through as a post hoc check on your final analysis as you previously suggested is also a good analysis choice.

EmmaDietrich commented 5 years ago

I see, I totally missed the "low-score" part of your last response. Changing the threshold to 0.1 identifies taxa that have been shown in other studies to be common contaminants, but doesn't remove taxa I think are in negatives b/c of cross-contamination, so that's great! I'm also glad to have a better understanding of how to choose the prevalence threshold. Thanks so much for all your help!

LinaMaMar commented 5 years ago

Hello, I still have doubts on how to make the interpretation of the histogram and set the value for the threshold. When the histogram shows a clear bimodality is easier but when there are more peaks present, I am not sure on how to proceed. Could you help me with this, please? This is what I got: Histogram of Physeq2

Thank you so much

benjjneb commented 5 years ago

@LinaMaMar

The question I would ask here is whether you value more highly making sure contaminants are removed from your data, or you only want to be sure to remove taxa that are definitely contaminants.

In this plot, I see a clear non-contaminant mode from ~0.7-1. Below that, there is a very low score mode at <0.1 which I would call definite contaminants, and then what seem likely to be contaminants up to about 0.4. I think setting your threshold at 0.1, 0.3, 0.4 or 0.5 would all be valid choices based on this plot. I would make my decision based on the question I posed above, choosing a higher threshold (more aggressive contaminant removal) if that is what I prioritized.

An alternative approach is to identify and inspect several taxa at those intermediate 0.2-0.4 scores. Do they seem out of place in the environment you are sampling? Do they overlap somewhat with previously observed contaminants? Then use that information to guide your threshold choice.

Hope that helps!

LinaMaMar commented 5 years ago

Thank you so much for your input, it was very useful. Yes, I am prioritizing the removal of contaminants so my approach is being more aggressive with the selection of the threshold. I checked the list of ASV removed at different thresholds and even at 0.4-0.5 most of the selected ASVs, are more likely to be contaminants, so it seems that 0.5 is a good choice. Thank you again for your explanation.