Bhamal commented 6 years ago

I have 16S rRNA miseq amplicon data for 96 samples. Among the samples I have 10 control samples (includes duplicates). My goal is to identify contaminants using the control samples. For that purpose, I would like to see most abundant taxa present in Control samples. I tried following 2 approaches;

Approach 1 1) Create phyloseq object with dada2 output ps <- phyloseq(otu_table(seqtab.nochim.all,taxa_are_rows=FALSE),sample_data(samdf),tax_table(taxa))

ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu)) # transform into relative abundance

2) Generate abundance matrix at genes level

ps.prop.genus <- tax_glom(ps.prop, "Genus") otu.genus.prop = t(as(otu_table(ps.prop.genus), "matrix")) # Extract abundance matrix genus.names.prop <- names(table(tax_table(ps.prop.genus)[, "Genus"], exclude = NULL)) rownames(otu.genus.prop) <- genus.names

3) Subset data for only 2 of the control samples (duplicates)

otu.genus.prop.control <- otu.genus.prop[,87:88] # subset of control samples rowsum.prop <- rowSums(otu.genus.prop.control) ind.zero.prop <- which(rowsum.prop==0) otu.genus.prop.control <- otu.genus.prop.control[-ind.zero.prop,]

4) “otu.genus.prop.control” shows that the most abundant genus are:

                    control1     control2

Achromobacter 0.18 0.17 Acidibacter 0.75 0.75

Approach 2 1) Create phyloseq object only with control samples ps1 <- phyloseq(otu_table(seqtab.nochim.all,taxa_are_rows=FALSE),sample_data(samdf1),tax_table(taxa.all))

samdf1 is a data frame that contains only 2 control samples

library(ggplot2) top100 <- names(sort(taxa_sums(ps1), decreasing=TRUE))[1:100] # using top 100 sequences ps1.top100 <- transform_sample_counts(ps1, function(OTU) OTU/sum(OTU)) ps1.top100 <- prune_taxa(top100, ps1.top100)

pdf("bar.pdf") plot_bar(ps1.top100, x="Subject", fill="Genus") + facet_wrap(~pm, scales="free_x") dev.off()

2) In “bar.pdf” the 2 most abundant genus are different (Fusobacterium and streptococcus) from that obtained in approach 1.

The bacteria I see in bar plot are the correct ones. I am wondering why I am getting different answer with approach 1 as I don’t see any thing wrong with the approach. Which approach is wrong? I would appreciate for any help here.

Best wishes, bhamal

benjjneb commented 6 years ago

You may be interested in the decontam software package we have developed, that is designed for exactly this purpose of identifying contaminants microbiome sequencing data.

Preprint: Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data

R package website: https://benjjneb.github.io/decontam/

As for why approach 1 has gone wrong, I'm not seeing it immediately, but I think it is suspicious that the two genera you've printed out seem likely to be the first two names in an alphabetical ordering of genus names.

Bhamal commented 6 years ago

Hi Benjamin,

Thank you for suggesting 'decontam'. I am going to try that out.

Regarding the first approach, now that you pointed out I think the same. I extracted the OTU table and replaced the sequences (column names) with genus level taxa as I show in step 2 of approach 1. So, it looks like the order of sequences in OTU column names is not same as genus level taxa names I extract as, genus.names.prop <- names(table(tax_table(ps.prop.genus)[, "Genus"], exclude = NULL))

If so, could you please suggest how else I can achieve that, ie. replace sequences in columns with genus names. This will be useful for reasons other than decontamination as well.

br,

benjjneb commented 6 years ago

Typo?

genus.names.prop <- names(table(tax_table(ps.prop.genus)[, "Genus"], exclude = NULL)) rownames(otu.genus.prop) <- genus.names

Bhamal commented 6 years ago

Sorry for the typo! but it was only in this page. There is no typo in my code. So, the problem seems to be that the 'genus.names.prop' is ordered alphabetically. Thus, it seems I cannot replace the sequences with the 'genus.names.prop' generated in that way.

I wonder if there is any other straight forward way to rename sequences in OTU table with taxa name as it is desirable for many purposes, at least for visualization or presentations.

best wishes,

benjjneb commented 6 years ago

I'm not seeing the issue, but you might want to repost this on the phyloseq Issues forum, since that is the package involved.

benjjneb / dada2

Investigating taxa in control samples #557

samdf1 is a data frame that contains only 2 control samples