benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

Decontam with extremely low biomass samples #50

Open liisaveerus opened 5 years ago

liisaveerus commented 5 years ago

Hi Benjamin,

First of all, I love the idea of statistically removing contaminants and in general, the promotion of including controls in microbial studies.

I am struggling a bit with a very low biomass data set. I have tissue and lavage samples from all reproductive organs of an avian host. The more internal the organ, the less bacterial diversity and abundance we expect, which is the main point we want to make with our study. Thus, I have collected several different negative controls to make sure that the species we detect are biologically present, not due to contamination. The controls include DNA extraction kit controls, PCR controls etc.

Unfortunately, I do not have precise bacterial DNA quantification results due to high host DNA contamination and must thus rely on prevalence data.

I find that with a more stringent threshold (=0.5), 211 ASVs out of 7886 ASVs are detected as contaminants. For comparison, if I excluded all ASVs that are present in control samples at least once, I would have 594 contaminant ASVs.

I am currently torn between how to clean my data set before data analysis. Should I remove all control-associated ASVs (means I will probably also remove some biologically relevant ASVs, but at least what remains is "definitely there") or rely on a set threshold (would probably do a better job at separating contaminants, but the worry remains that I will find false positives in the upper organs)? I am fairly certain that many upper organ samples have mostly contaminants present, which may falsely categorise some ASVs as truly biologically present, especially as half of my data set consists of these upper organ samples. Again, for comparison, if I were to remove all control-associated ASVs, quite a lot of my sequence reads would be lost, but all samples (even the upper organs) would still have reads and ASVs present in them.

Prevalence data

Thanks in advance!

benjjneb commented 5 years ago

Should I remove all control-associated ASVs (means I will probably also remove some biologically relevant ASVs, but at least what remains is "definitely there") or rely on a set threshold (would probably do a better job at separating contaminants, but the worry remains that I will find false positives in the upper organs)?

In general, I would be wary of removing all ASVs in your control samples because of the possibility of cross-contamination from true samples into negative controls. The rate at which this happens varies a lot, so it may be minor in your data, but the issue is that the most abundant true ASVs are the most likely to be cross-contaminated, and thus be lost due to blanket removal of all ASVs in neg controls.

Looking at the plot you've shown, I at least would be comfortable with setting a threshold close to the on you've chosen for the plot (0.5?). In that plot there are two clear "arms", representing ASVs found in roughly equal proportions in neg controls, and those found at much higher prevalence in positive controls. That is pretty well picked out by the threshold you've set, although given your priors about there being high contamination in this dataset, I might push the threshold just a bit higher (0.6?) to also flag those "red" points just above the x~y line.

But that's tweaking at this point, from what you've shown I think you are already doing a solid job on this.

liisaveerus commented 5 years ago

Thanks for the validation!

I also tried threshold =0.6 and that identified 869 ASVs, which confuses me a bit, as in total there are 594 ASVs that are present =>1 in negative controls. Why is the identified contaminant counter bigger than that?

I also tried other thresholds between 0.5 and 0.6. E.g. 0.57 identifies 581 contaminants and the "arms" are slightly clearer.

0 57

I am starting to think that anything beyond threshold =0.5 is just overtweaking for perfection and I should stick with it. Thanks again, this has been very helpful!

benjjneb commented 5 years ago

I also tried threshold =0.6 and that identified 869 ASVs, which confuses me a bit, as in total there are 594 ASVs that are present =>1 in negative controls. Why is the identified contaminant counter bigger than that?

Once you go over a threshold of 0.5, you are in effect asking that ASVs have enough statistial evidence to "prove" they are not contaminants, i.e. you've flipped the burden of proof. When you push the threshold ot 0.6, you are probably removing a number of ASVs that are present in just one sample, a non-control sample, but that is no longer enough evidence to demonstrate they are not contaminants at this >0.5 threshold.

E.g. 0.57 identifies 581 contaminants and the "arms" are slightly clearer.

I think it makes a lot of sense.

I am starting to think that anything beyond threshold =0.5 is just overtweaking for perfection and I should stick with it.

Maybe, but taking a closer "eye test" look at things is not a bad thing. There is almost always a range of "right" answers. The important thing is to do what you are doing, look critically at the results of what the methods you are using are doing, and to avoid the "wrong" answers when things break but you might not otherwise notice. Your process here is right on.

liisaveerus commented 5 years ago

Thanks so much, you have been very generous with your time!

My instincts are telling me to go with =0.57, but explaining such arbitrary thresholds to other biologists is always fun (does not mean it should not be done!) ;)

One last question: do you have any recommendations for further data analysis? I imagine phyloseq for visuals, but what about statistics? vegan? Thank you!

benjjneb commented 5 years ago

phyloseq and vegan are both good choices. phyloseq has a number of analysis examples online as well: https://joey711.github.io/phyloseq/tutorials-index.html

You could also consider our F1000Research paper "Bioconductor workflow for microbiome data analysis: from raw reads to community analyses" for potential inspiration: https://f1000research.com/articles/5-1492/v2