Using decontam to adjust taxonomic abundances based on taxa present in negative control

smalanmul commented 6 years ago

I am new to the field of gut microbiome data analysis. I did 16S rRNA seq (V4 region) for DNA extracted from human stool samples. I saw that there were close to 24 000 reads present in my negative control. I want to know if I can use decontam on my dataset to rectify this contamination issue. I suspect it comes from the library pre, seeing as the Bioanalyzer showed 0ng/ul DNA present in my neg ctrl sample, and the composition of the neg ctrl is not typical of stool microbiota. The most abundant taxa were Sediminibacterium (12 487 reads) and Phyllobacterium (3 637 reads), but there are also taxa that are usually quite abundant in gut microbiome data sets, such as Prevotella and Bacteroides, that were present in this negative sample. Therefore, I do not merely want to remove these taxa from all samples, as this will not be an accurate representation of the true gut microbial composition. I was wondering whether decontam addresses this problem, from the paper on bioRxiv it seems like it does, but I am unsure of the script performing this correction/removal step. I went through the tutorial online, but I am unsure of what script to use to do the actual removal/counts correction based on taxa present in the negative control. I would appreciate any feedback, thanks.

benjjneb commented 6 years ago

Hi, In the tutorial the line used to remove the taxa identified as contaminants was this one:

ps.noncontam <- prune_taxa(!contamdf.freq$contaminant, ps)

You would use bascially the same command after running prevalence-based identification on your phyloseq object, or if you are working directly with the sequence table as a matrix, you can just remove columns that were identified as contaminants, e.g. seqtab.noncontam <- seqtab[,!contamdf$contaminant)]

The most abundant taxa were Sediminibacterium (12 487 reads) and Phyllobacterium (3 637 reads), but there are also taxa that are usually quite abundant in gut microbiome data sets, such as Prevotella and Bacteroides, that were present in this negative sample. Therefore, I do not merely want to remove these taxa from all samples, as this will not be an accurate representation of the true gut microbial composition. I was wondering whether decontam addresses this problem,

Absolutely, this was one of the motivations for creating decontam. The prevalence method will not flag taxa as contaminants just because they are in your negative controls. They have to be more prevalent in the negative controls than in the true sample to get flagged (i.e. present in a higher fraction of samples). Thus those abundant and true cross-contaminants shouldn't be flagged and won't be removed.

smalanmul commented 6 years ago

Thanks very much, this is very helpful! I recently had a colleague with concerns about using decontam, and her concern as that sequencing data is compositional and that the proportion of reads from contaminants also affect the number of reads of other taxa. There probably isn't any way of getting around that particular issue is there?

benjjneb commented 6 years ago

Compositionality is intrinsic to how the prevalence method works. That is, as the number of true sequences increases, the proportion of contaminants declines, often until it is below the detection limit. That is why negative controls have a higher prevalence of contaminants.

For the frequency method, in the limit for which the frequency method works best (C << S) compositional effects can be largely ignored. Furthermore, the model comparison framework we are using is fairly robust to deviations from the strict quantitative model -- that is even if there is some perturbation from a 1/T frequency dependence, it is still likely that the comparison between a 1/T model and a ~1 model will identify the contaminant correctly.

smalanmul commented 6 years ago

Hi

So do you think I should use the prevalence method rather for my data?

Kind regards Stefanie

From: Benjamin Callahan notifications@github.com<mailto:notifications@github.com> Reply-To: benjjneb/decontam reply@reply.github.com<mailto:reply@reply.github.com> Date: Tuesday, 07 August 2018 at 15:57 To: benjjneb/decontam decontam@noreply.github.com<mailto:decontam@noreply.github.com> Cc: Stefanie Malan-Muller smalan@sun.ac.za<mailto:smalan@sun.ac.za>, Author author@noreply.github.com<mailto:author@noreply.github.com> Subject: Re: [benjjneb/decontam] Using decontam to adjust taxonomic abundances based on taxa present in negative control (#23)

Hi, In the tutorial the line used to remove the taxa identified as contaminants was this one:

ps.noncontam <- prune_taxa(!contamdf.freq$contaminant, ps)

You would use bascially the same command after running prevalence-based identification on your phyloseq object, or if you are working directly with the sequence table as a matrix, you can just remove columns that were identified as contaminants, e.g. seqtab.noncontam <- seqtab[,!contamdf$contaminant)]

The most abundant taxa were Sediminibacterium (12 487 reads) and Phyllobacterium (3 637 reads), but there are also taxa that are usually quite abundant in gut microbiome data sets, such as Prevotella and Bacteroides, that were present in this negative sample. Therefore, I do not merely want to remove these taxa from all samples, as this will not be an accurate representation of the true gut microbial composition. I was wondering whether decontam addresses this problem,

Absolutely, this was one of the motivations for creating decontam. The prevalence method will not flag taxa as contaminants just because they are in your negative controls. They have to be more prevalent in the negative controls than in the true sample to get flagged (i.e. present in a higher fraction of samples). Thus those abundant and true cross-contaminants shouldn't be flagged and won't be removed.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/benjjneb/decontam/issues/23#issuecomment-411065290, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AkRXXhr6IwPHNYegxr1-5ETfXuk5OdwZks5uOZy2gaJpZM4VyKy4.

[http://cdn.sun.ac.za/100/ProductionFooter.jpg]http://www.sun.ac.za/english/about-us/strategic-documents

The integrity and confidentiality of this email are governed by these terms. Disclaimerhttp://www.sun.ac.za/emaildisclaimer Die integriteit en vertroulikheid van hierdie e-pos word deur die volgende bepalings bereël. Vrywaringsklousulehttp://www.sun.ac.za/emaildisclaimer

smalanmul commented 6 years ago

Hi

So based on what you stated above (The prevalence method will not flag taxa as contaminants just because they are in your negative controls. They have to be more prevalent in the negative controls than in the true sample to get flagged (i.e. present in a higher fraction of samples). Thus those abundant and true cross-contaminants shouldn't be flagged and won't be removed.) - should I rather use the prevalence method? Then for the prevalence method - how do I remove the contaminating taxa from the phyloseq object? Is it similar to the frequency method, thus based on tutorial: ps.noncontam <- prune_taxa(!contamdf.prev05$contaminant, ps)

Thanks for your help!

benjjneb commented 6 years ago

So based on what you stated above (The prevalence method will not flag taxa as contaminants just because they are in your negative controls. They have to be more prevalent in the negative controls than in the true sample to get flagged (i.e. present in a higher fraction of samples). Thus those abundant and true cross-contaminants shouldn't be flagged and won't be removed.) - should I rather use the prevalence method?

Sure. I think the frequency method would also work for your data, but the prevalence method is fine on its own.

Then for the prevalence method - how do I remove the contaminating taxa from the phyloseq object? Is it similar to the frequency method, thus based on tutorial: ps.noncontam <- prune_taxa(!contamdf.prev05$contaminant, ps)

Yep, will work just the same.

benjjneb / decontam

Using decontam to adjust taxonomic abundances based on taxa present in negative control #23