benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
144 stars 24 forks source link

Result vary as per sample size #140

Open achald7867 opened 10 months ago

achald7867 commented 10 months ago

Dear Developer,

Firstly, thank you for this fantastic R package. I am working with low microbial biomass samples from the nasopharynx. We are using whole metagenomic sequencing data to characterize the microbiome and resistome (ARG) profile of these nasopharyngeal samples. We have spiked all our samples with the mock community, including Imtechella_halotolerans, Truepera_radiovictrix, and Allobacillus_halotolerans.

I have used MetaPhlan to characterize the microbiome profile and we have lowered the cut-off so that we can identify these three spike-in species. Then, we attempted to remove contaminants using the decontam package. I have some questions regarding this process:

  1. In total, we have 344 samples that were processed and sequenced together. However, we need to subset the data to analyze specific groups of interest. When I ran the analysis on the relative frequency tables, the results varied. However, the three spike-ins were identified in both sets of data. So, my question is whether I should run this analysis on all samples or only on the subset?

  2. Should I choose a p-value cut-off (default 0.1) based on the p-values of the three spike-ins? In other words, should I remove any contaminant with a p-value lower than the p-values of the three spike-ins? Is this approach too conservative?

  3. Can I apply decontam to the resistome frequency table, which contains relative abundance information of antibiotic resistance genes across samples?

Thank you for your assistance.

Best regards, Dr. Achal Dhariwal

benjjneb commented 10 months ago

So, my question is whether I should run this analysis on all samples or only on the subset?

I'm assuming the analysis you are referring to is decontam.

In general, decontam does better with more data. However, there is an assumption that the underlying sample types and contamination process is consistent. I don't know how you are subsetting the data, but if your larger dataset is all data of the same sample type, and used a consistent measurement methodology, then applying decontam to the whole datasets is probably better. You can also take a look at the batch option in isContaminant if there are batches within your data that were differently processed.

Should I choose a p-value cut-off (default 0.1) based on the p-values of the three spike-ins? In other words, should I remove any contaminant with a p-value lower than the p-values of the three spike-ins? Is this approach too conservative?

Setting the score threshold to classify contaminants depends on datasets characteristics and goals of your study. I would recommend inspecting the histogram of decontam scores as a first step. The basics are described in the decontam paper (see Choice of classification threshold section), and more in depth exploration in R is available in the Github repo associated with that manuscript, see especially the oral data analysis: https://github.com/benjjneb/DecontamManuscript

Take a look at those results. If there is a clear bimodality (low scoring contaminants, mid-high scoring real ASVs) then threshold choice is often straightforward. If it isn't that clean then think about what's more important for your study, reducing contamination in your data, or not inadvertantly removing real taxa.

Can I apply decontam to the resistome frequency table, which contains relative abundance information of antibiotic resistance genes across samples?

Yes.

achald7867 commented 8 months ago

Dear Dr. Benjamin, Thank you so much for your reply.

  1. Yes, the dataset is all data of the same sample type and uses a consistent measurement methodology. So I have applied decontam on the whole dataset.

  2. As you have mentioned, threshold depends on data characteristics and goals of the study. As suggested I have made the histogram, and based on it I am sticking with the default threshold (frequency-based method ) = 0.1. Could you please suggest if this is OK? (The figure is attached)

  3. Lastly, I have identified contaminants as species frequencies that were inversely proportional as well as independent of the DNA concentration. Interestingly, the three species we identified as contaminants were the spike-ins species. However, some of the identified contaminant species had very low prevalence (present in only 2 samples), can those be considered as contaminants? (Figure is attached)

decontam_results_boxplot_microbiome.pdf Microbiome_histogram.pdf

Looking forward to hear from you.

Dr. Achal Dhariwal

benjjneb commented 7 months ago

Sorry for late response, this got lost to me over the holidays.

As you have mentioned, threshold depends on data characteristics and goals of the study. As suggested I have made the histogram, and based on it I am sticking with the default threshold (frequency-based method ) = 0.1. Could you please suggest if this is OK? (The figure is attached)

Looks appropriate to me. There seems to be a low-score mode, and that threshold is appropriate for separating out that low-score mode.

Lastly, I have identified contaminants as species frequencies that were inversely proportional as well as independent of the DNA concentration. Interestingly, the three species we identified as contaminants were the spike-ins species. However, some of the identified contaminant species had very low prevalence (present in only 2 samples), can those be considered as contaminants? (Figure is attached)

"the three species we identified as contaminants were the spike-ins species". That is a good sign! I'm assuming you were spiking those species in at a constant concentration, and that is the same as the signal decontam is looking for to ID contaminants (constant background concentration). The other contaminants being identified from just 2-3 samples are questionable -- they can also happen by accident. But overall what I am hearing is that it looks like your data is pretty clean contaminant-wise.

achald7867 commented 7 months ago

Dear Benjamin, Thank you so much for your time and response. 😊


From: Benjamin Callahan @.***> Sent: 22 January 2024 23:05:34 To: benjjneb/decontam Cc: Achal Dhariwal; Author Subject: Re: [benjjneb/decontam] Result vary as per sample size (Issue #140)

Sorry for late response, this got lost to me over the holidays.

As you have mentioned, threshold depends on data characteristics and goals of the study. As suggested I have made the histogram, and based on it I am sticking with the default threshold (frequency-based method ) = 0.1. Could you please suggest if this is OK? (The figure is attached)

Looks appropriate to me. There seems to be a low-score mode, and that threshold is appropriate for separating out that low-score mode.

Lastly, I have identified contaminants as species frequencies that were inversely proportional as well as independent of the DNA concentration. Interestingly, the three species we identified as contaminants were the spike-ins species. However, some of the identified contaminant species had very low prevalence (present in only 2 samples), can those be considered as contaminants? (Figure is attached)

"the three species we identified as contaminants were the spike-ins species". That is a good sign! I'm assuming you were spiking those species in at a constant concentration, and that is the same as the signal decontam is looking for to ID contaminants (constant background concentration). The other contaminants being identified from just 2-3 samples are questionable -- they can also happen by accident. But overall what I am hearing is that it looks like your data is pretty clean contaminant-wise.

— Reply to this email directly, view it on GitHubhttps://github.com/benjjneb/decontam/issues/140#issuecomment-1904905561, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARDRGC7WO5CR3XS6N54IGSLYP3PC5AVCNFSM6AAAAAA7BEN4BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBUHEYDKNJWGE. You are receiving this because you authored the thread.Message ID: @.***>

achald7867 commented 7 months ago

Dear Benjamin, Hii again. But I have a basic question regarding the data required for the isContaminant function.

In the package documentation, it is mentioned that for frequency-based contaminant identification, one of the types of auxiliary data needed is quantitative DNA concentrations for each sample. “Typically, these concentrations are obtained during amplicon or shotgun sequencing library preparation, often in the form of standardized fluorescence intensity (e.g., PicoGreen)”.

After library preparation, we follow a protocol where equimolar concentrations are prepared for all samples (https://doi.org/10.1016/j.diagmicrobio.2019.04.014). These libraries are then submitted to the sequencing service. My question is, should we use the concentration of the prepared DNA library before or after preparing an equimolar pool, where each sample has the same concentration?

benjjneb commented 7 months ago

should we use the concentration of the prepared DNA library before or after preparing an equimolar pool, where each sample has the same concentration?

before