benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
147 stars 25 forks source link

"Biological ASVs" identified as contaminants / "Contaminant ASVs" not identified as contaminants #117

Open tellafiela opened 2 years ago

tellafiela commented 2 years ago

We have sequenced nasopharyngeal (NP) specimens from infants which are in general low biomass specimens. We included non-template controls in each run. Overall, 128 NTCs and 708 biological samples were included. Decontam was performed using the combined method, threshold set to 0.4.

A previously published paper from our group (https://bmcmicrobiol.biomedcentral.com/articles/10.1186/s12866-020-01795-7) showed that ASVs from the genera Noviherbaspirillum and Massilia are potential contaminants in this low biomass dataset of NP specimens.

Our current analysis using a different set of NP specimens from the same study, however, does not flag numerous ASVs belonging to these genera (and other) as contaminants, which is a concern to us. In addition, if we set the threshold a bit higher, we start flagging ASVs which we know are of biological importance in this dataset (eg. ASV5: Staphylococcus genus) as contaminants.

We have tried the following:

  1. all 3 methods (combined, prevalence and frequency) using different cutoffs spanning from 0.4 to 0.9 (full dataset: 128 NTCs, 708 NP specimens)
  2. removing samples and NTCs with VERY low copy numbers (<10, <20, <30, <50) from the dataset prior to performing step1 above
  3. removing biological samples flagged as "problematic samples" (low copy numbers, low reads, collected at the first 2 weeks of life - as per our publication referred to above) prior to performing step1 above (128 NTCs, 681 NP specimens)

Yet - none of these methods have improved our output. We are still left with Noviherbaspirillum and Massilia ASVs with small ASV numbers (eg ASV24, ASV26, ASV39, ASV41, etc). As soon as thresholds are increased, we start flagging biological ASVs (eg ASV5: Staphylococcus).

I am attaching a few plots for ASV1, ASV5 (biological ASVs) and ASV24, ASV26 (potential contaminant ASVs). The x-axis shows relative abundances of each ASV for NTCs and biological samples with y-axes showing raw copy numbers and log copy numbers.

Any help on this issue would be much appreciated.

Shantelle Claassen-Weitz Decontam issues.pdf

benjjneb commented 2 years ago

Hi Shantelle, I don't know if I can give you a specific "solution". I think you are doing the right thing in critically looking at the output that decontam is giving and comparing it to your expert knowledge in this area. An optimal solution might be a combination of using an automated tool like decontam and that domain knowledge.

decontam is useful, but not perfect, and it's prediction accuracy declines as you enter the "extremely" low biomass realm of sample types in which the number of contaminant reads approaches parity (or exceeds) the number of real reads in a sample. Is it possible that you are in this regime for these samples, at least some of the very early life samples?

As far as specifics, I would not typically use a threshold higher (more aggressive in flagging contaminants) than 0.5. And keeping account of the abundance and prevalence of the ASVs being (mis)classified is also worth doing. In general, decontam seems more accurate on higher abundance/prevalence ASVs (see manuscript for more).

Hope that helps some. It sounds like you are doing the right thing. Low biomass microbiome sampling is not easy to do accurately!

tellafiela commented 2 years ago

Dear Benjamin,

Thank you for your feedback. We will continue to critically look at the decontam output and compare with our knowledge in this area.

I will let you know if I have any additional questions regarding this topic.

Kind regards Shantelle

Dr. Shantelle Claassen-Weitz (PhD Med Microbiology) Department of Pathology Division of Medical Microbiology Falmouth Building, 5th Floor, Room 5.27 Faculty of Health Sciences, University of Cape Town Anzio Road, Observatory, 7925 074 521 4909 / 021 406 6224

On Thu, Oct 6, 2022 at 4:55 PM Benjamin Callahan @.***> wrote:

Hi Shantelle, I don't know if I can give you a specific "solution". I think you are doing the right thing in critically looking at the output that decontam is giving and comparing it to your expert knowledge in this area. An optimal solution might be a combination of using an automated tool like decontam and that domain knowledge.

decontam is useful, but not perfect, and it's prediction accuracy declines as you enter the "extremely" low biomass realm of sample types in which the number of contaminant reads approaches parity (or exceeds) the number of real reads in a sample. Is it possible that you are in this regime for these samples, at least some of the very early life samples?

As far as specifics, I would not typically use a threshold higher (more aggressive in flagging contaminants) than 0.5. And keeping account of the abundance and prevalence of the ASVs being (mis)classified is also worth doing. In general, decontam seems more accurate on higher abundance/prevalence ASVs (see manuscript for more).

Hope that helps some. It sounds like you are doing the right thing. Low biomass microbiome sampling is not easy to do accurately!

— Reply to this email directly, view it on GitHub https://github.com/benjjneb/decontam/issues/117#issuecomment-1270208201, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAG4XO47ZY62T4JESPB6ITWB3R7JANCNFSM6AAAAAAQZTQIGA . You are receiving this because you authored the thread.Message ID: @.***>