Open Gian77 opened 3 years ago
The first diagnostic I might try is to remove the batch="Plate"
parameter in the isContaminant
call, and see if there is some weirdness associated with how batching is being done?
Outside of that, I do not see an obvious reason for what you are seeing, and you shouldn't be getting more contaminant calls than taxa present in the negative controls using the prevalence method alone.
If the batch
investigation doesn't turn up a solution, would it be possibel for you to share this data object with me so I can poke at it?
Hello @benjjneb,
thanks for your email. I did a quick test without the batch
option and the number of contaminants is lower than before but still way higher than 99...
I am happy to send you the data but since it is unpublished can I just send it to you, privately? I am not sure there is a way to do that in github...
Thanks much fr your help!
Gian
Sure, you can email me at benjamin DOT j DOT callahan AT gmail DOT com
Preferably summarized data (not raw data), for example the phyloseq object, along with the code you are running to see this result.
Awesome, thank you @benjjneb, an email is coming to your way.
Hello @benjjneb Just wondering if you recevied and got a chance to look into the data I sent you a while ago by email. Thanks a lot, Gian
Apologies, I did not yet. I transitioned computers about this time and fell out of my to-do list.
I found your previous email and will take a look.
Awesome, thanks a lot.
I think what is going on here is some issue with the phyloseq code and interpretation of those results, and not anything to do with decontam. My initial code and results (below) line up with what you posted:
library(phyloseq)
library(decontam)
setwd("~/Desktop/Gian/")
ps <- readRDS("physeq_uparse.rds")
table(sample_data(ps)$is.neg, useNA="always")
## FALSE TRUE <NA>
## 155 10 0
contam_uparse <-
decontam::isContaminant(
ps,
method = "prevalence",
batch = "Plate",
neg = "is.neg",
threshold = 0.1)
table(contam_uparse$contaminant)
## FALSE TRUE
## 3947 776
However, I couldn't get your next code line/block to run, so I tried to recapitulate it with my own code:
psn <- prune_samples(sample_data(ps)$is.neg, ps)
table(taxa_sums(psn)>0)
## FALSE TRUE
## 2124 2599
table(is.contam=contam_uparse$contaminant,
in.neg=taxa_sums(psn)>0)
## in.neg
## is.contam FALSE TRUE
## FALSE 2124 1823
## TRUE 0 776
These results show that there are 2599 taxa present in the negative control samples, and that all taxa identified as contaminants by decontam-prevalence were also present in the negative control samples, as expected.
Thank you @benjjneb
I am kind of stil confused. I tried to run your code and my code again. I think the issue comes from prune_samples
that pool out the wrong samples instead of the real controls, compared to subset_samples
.
Please see the code below
> psn <- prune_samples(sample_data(ps)$is.neg, ps)
> psn
phyloseq-class experiment-level object
otu_table() OTU Table: [ 4723 taxa and 10 samples ]
sample_data() Sample Data: [ 10 samples by 16 sample variables ]
tax_table() Taxonomy Table: [ 4723 taxa by 13 taxonomic ranks ]
refseq() DNAStringSet: [ 4723 reference sequences ]
> table(taxa_sums(psn)>0)
FALSE TRUE
2124 2599
> psn_new <- subset_samples(ps, is.neg%in%"TRUE")
> psn_new
phyloseq-class experiment-level object
otu_table() OTU Table: [ 4723 taxa and 10 samples ]
sample_data() Sample Data: [ 10 samples by 16 sample variables ]
tax_table() Taxonomy Table: [ 4723 taxa by 13 taxonomic ranks ]
refseq() DNAStringSet: [ 4723 reference sequences ]
> table(taxa_sums(psn_new)>0)
FALSE TRUE
4624 99
> head(otu_table(psn))
OTU Table: [6 taxa and 10 samples]
taxa are rows
bcse131 bcse145 bcse63 bcse45 bcse104 bcse101 bcse115 bcse95 bcse81 bcse123
OTU_1497 328 373 6600 133 1309 1508 83 10760 5809 66
OTU_50 334 3485 0 42 0 0 101 0 0 277
OTU_31 509 178 517 465 11 16 301 413 273 11
OTU_290 50 2 0 406 0 0 0 0 0 0
OTU_191 2 6 674 3 1 0 0 12 9 0
OTU_539 0 0 7 0 35 74 0 10 51 0
> head(otu_table(psn_new))
OTU Table: [6 taxa and 10 samples]
taxa are rows
bcse122 bcse74 bcse157 bcse143 bcse112 bcse94 bcse60 bcse47 bcse26 bcse16
OTU_1497 0 81 0 0 0 0 0 0 0 0
OTU_50 1 0 0 0 0 1 0 0 0 0
OTU_31 0 0 0 0 0 0 0 0 0 0
OTU_290 0 0 0 0 0 0 0 0 0 0
OTU_191 0 0 0 0 0 0 0 0 0 0
OTU_539 0 0 0 0 0 0 0 0 0 0
> sample_data(ps)[1:12,]
Sample Data: [12 samples by 16 sample variables]:
BarcodeSequence LinkerPrimerSequence bcsenumb marker Plate Niche Crop Code Plot Description
bcse26 ACTGCTATCG CTTGGTCATTTAGAGGAAGTAA bcse26 ITS P1 Control Control Control Control C-
bcse16 ATCTTGGAGT CTTGGTCATTTAGAGGAAGTAA bcse16 ITS P1 Control Control Control Control C-
bcse47 ACCTTGACAA CTTGGTCATTTAGAGGAAGTAA bcse47 ITS P1 Control Control Control Control C-
bcse60 CACACTGAAG CTTGGTCATTTAGAGGAAGTAA bcse60 ITS P1 Control Control Control Control C-
bcse94 GAGGTATTCT CTTGGTCATTTAGAGGAAGTAA bcse94 ITS P1 Control Control Control Control C-
bcse112 CCATATCCCG CTTGGTCATTTAGAGGAAGTAA bcse112 ITS P2 Control Control Control Control C-
bcse157 TCGGTAGCAA CTTGGTCATTTAGAGGAAGTAA bcse157 ITS P2 Control Control Control Control C-
bcse143 TACAAGTGGT CTTGGTCATTTAGAGGAAGTAA bcse143 ITS P2 Control Control Control Control C-
bcse122 TTGAGTGGTC CTTGGTCATTTAGAGGAAGTAA bcse122 ITS P2 Control Control Control Control C-
bcse74 GGTCATCACG CTTGGTCATTTAGAGGAAGTAA bcse74 ITS P1 Control Control Control Control C-
bcse64 TTCTTCTACC CTTGGTCATTTAGAGGAAGTAA bcse64 ITS P1 Leaf Corn G1 R4 LFG1r4
bcse152 CGTATGCCGT CTTGGTCATTTAGAGGAAGTAA bcse152 ITS P2 Soil Native Grasses G7 R1 SLG7r1
Richness Shannon InvSimpson LibrarySize Index is.neg
bcse26 1 0.0000000 1.000000 1 1 TRUE
bcse16 2 0.6931472 2.000000 2 2 TRUE
bcse47 3 1.0986123 3.000000 3 3 TRUE
bcse60 1 0.0000000 1.000000 48 4 TRUE
bcse94 5 0.1617183 1.056123 148 5 TRUE
bcse112 4 0.6680875 1.746898 176 6 TRUE
bcse157 5 0.7642251 1.966285 216 7 TRUE
bcse143 5 0.7735465 2.040226 311 8 TRUE
bcse122 79 2.5729780 4.977436 432 9 TRUE
bcse74 6 1.5614922 4.475736 548 10 TRUE
bcse64 94 3.4320121 15.989638 662 11 FALSE
bcse152 726 5.0419726 52.839258 12776 12 FALSE
The real control samples are those from subset_samples
thats is psn_new
as you can see above from the sample_data(ps)
above. All the controls match the samples I was able to pool out with subset_samples
.
> sample_names(psn)
[1] "bcse131" "bcse145" "bcse63" "bcse45" "bcse104" "bcse101" "bcse115" "bcse95" "bcse81" "bcse123"
> sample_names(psn_new)
[1] "bcse122" "bcse74" "bcse157" "bcse143" "bcse112" "bcse94" "bcse60" "bcse47" "bcse26" "bcse16"
Am I wrong? I think decontam filter the ps object the same way... is that possible? Thanks a lot again for spending the time with this. Gian
OK, so there is something wrong with the phyloseq object itself -- the samples in the otu_table
and sample_data
are not in matched order...
head(sample_names(sample_data(ps)))
## [1] "bcse26" "bcse16" "bcse47" "bcse60" "bcse94" "bcse112"
head(sample_names(otu_table(ps)))
## [1] "bcse131" "bcse145" "bcse63" "bcse45" "bcse104" "bcse101"
head(sample_names(ps))
## [1] "bcse131" "bcse145" "bcse63" "bcse45" "bcse104" "bcse101"
# Different sample order in sample_data!?!?
My understanding is that this shouldn't be possible in a phyloseq object, at least one constructed through normal operations. I don't really use phyloseq myself anymore, but a big part of its purpose was to take care of the harmonization between the different data objects, including consistent ordering of samples. As a quick check, I loaded up one of the data examples in the package, and it does have samples in matched order:
data(GlobalPatterns)
gp <- GlobalPatterns
head(sample_names(sample_data(gp)))
## [1] "CL3" "CC1" "SV1" "M31Fcsw" "M11Fcsw" "M31Plmr"
head(sample_names(otu_table(gp)))
## [1] "CL3" "CC1" "SV1" "M31Fcsw" "M11Fcsw" "M31Plmr"
head(sample_names(gp))
## [1] "CL3" "CC1" "SV1" "M31Fcsw" "M11Fcsw" "M31Plmr"
There is a data-forensic question here: How did this phyloseq object get its otu table and sample data in mismatched order? But as for applying decontam to this phyloseq object, this is definitely a problem and almost certainly explains the aberrant results you are observing by causing the wrong samples to be identified as the controls.
Kudos to you for looking at your data critically and catching this. For moving forward with decontam, regenerating the phyloseq object so that samples are in matched order should do the trick. Alternatively, you can just pull out the necessary components into base R objects and do it by hand as well.
Hey @benjjneb
Thanks a lot and yeah, that's very interesting. I am also not using phyloseq
very often, but I did for this dataset since involved other collaborators that were using it.
I indeed thought that phyloseq was able to reorganize the data itself, as you said, I think data organization and harmonization was the main purpose of that package that basically doesn't add anything new to vegan
in terms of analysis power.
This is how I created the object:
mapping <-read.delim("clustering_test/mapping_ITS.txt", row.names=1, header=TRUE, sep="\t")
table_uparse <- read.delim("clustering_test/otu_table_ITS_UPARSE_R1.txt",row.names=1, header=TRUE, sep="\t")
constax_uparse <-read.delim("clustering_test/constax_taxonomy_UPARSE.txt",header=TRUE, row.names=1, sep="\t")
otu_uparse <- readDNAStringSet("clustering_test/otus_R1.fasta", format="fasta", seek.first.rec=TRUE, use.names=TRUE)
physeq_uparse <-
phyloseq(
otu_table(table_uparse, taxa_are_rows = TRUE),
sample_data(mapping),
tax_table(as.matrix(constax_uparse)),
otu_uparse)
I've laways used decontam through phyloseq, but I'd love to avoid it since I am using it less and less... Thanks again, I am glad we solved the mistery ;) Gian
Would you mind sharing the underlying files used to create the phyloseq with me? I'm still curious about how that is happening given what looks like standard construction of a phyloseq object. I'd be interested to poke at this example, and see if an issue needs to be posted to the phyloseq forum.
Sure, let me double check the files and I will write you an email.
Hello,
I am trying to figure out what is wrong with my analysis. I am trying to decontaminate this dataset
I have 10 negative controls distributed in two series of PCRs
This is what my code
The weird thing is that I have only 99 taxa among my negative controls, how is it possible that the function finds 776?
I have tried to debug this in many ways but I do not found what I doing wrong. Comparing the OTUs found in the Negative controls with the 776 detected as contaminants looks like that most of them are not even in there.
Any help to understand?
Thanks a lot!
Gian