benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
148 stars 25 forks source link

Weird decontam performance #105

Open Gian77 opened 3 years ago

Gian77 commented 3 years ago

Hello,

I am trying to figure out what is wrong with my analysis. I am trying to decontaminate this dataset

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4723 taxa and 165 samples ]
sample_data() Sample Data:       [ 165 samples by 16 sample variables ]
tax_table()   Taxonomy Table:    [ 4723 taxa by 13 taxonomic ranks ]
refseq()      DNAStringSet:      [ 4723 reference sequences ]

I have 10 negative controls distributed in two series of PCRs

  is.neg Plate  n
1  FALSE    P1 90
2  FALSE    P2 65
3   TRUE    P1  6
4   TRUE    P2  4

This is what my code

> contam_uparse <-
+   decontam::isContaminant(
+     physeq_uparse,
+     method = "prevalence",
+     batch = "Plate",
+     neg = "is.neg",
+     threshold = 0.1)
> table(contam_uparse$contaminant) 

FALSE  TRUE 
 3947   776 

The weird thing is that I have only 99 taxa among my negative controls, how is it possible that the function finds 776?

> prune_taxa(
+   taxa_sums(
     subset_samples(physeq_uparse, is.neg%in%"TRUE")) > 0,
+   subset_samples(physeq_uparse, is.neg%in%"TRUE")
)
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 99 taxa and 10 samples ]
sample_data() Sample Data:       [ 10 samples by 16 sample variables ]
tax_table()   Taxonomy Table:    [ 99 taxa by 13 taxonomic ranks ]
refseq()      DNAStringSet:      [ 99 reference sequences ]

I have tried to debug this in many ways but I do not found what I doing wrong. Comparing the OTUs found in the Negative controls with the 776 detected as contaminants looks like that most of them are not even in there.

> setdiff(as.vector(otus_in_neg), as.vector(decontam_otus))
 [1] "OTU_1497" "OTU_50"   "OTU_5"    "OTU_4"    "OTU_3"    "OTU_401"  "OTU_14"   "OTU_1"    "OTU_52"   "OTU_10"   "OTU_25"   "OTU_208"  "OTU_6"    "OTU_17"   "OTU_361"  "OTU_104"  "OTU_78"  
[18] "OTU_55"   "OTU_51"   "OTU_180"  "OTU_164"  "OTU_272"  "OTU_11"   "OTU_46"   "OTU_20"   "OTU_38"   "OTU_241"  "OTU_122"  "OTU_18"   "OTU_45"   "OTU_64"   "OTU_1106" "OTU_42"   "OTU_1064"
[35] "OTU_88"   "OTU_318"  "OTU_41"   "OTU_116"  "OTU_358"  "OTU_30"   "OTU_385"  "OTU_49"   "OTU_200"  "OTU_19"   "OTU_48"   "OTU_93"   "OTU_3997" "OTU_43"   "OTU_91"   "OTU_595"  "OTU_225" 
[52] "OTU_212"  "OTU_731"  "OTU_307"  "OTU_685"  "OTU_74"   "OTU_505"  "OTU_597"  "OTU_253"  "OTU_250"  "OTU_178"  "OTU_423"  "OTU_531"  "OTU_128"  "OTU_249"  "OTU_626"  "OTU_270"  "OTU_2627"
[69] "OTU_679"  "OTU_325"  "OTU_624"  "OTU_454"  "OTU_425"  "OTU_4200" "OTU_435"  "OTU_966"  "OTU_540"  "OTU_452"  "OTU_843"  "OTU_5867" "OTU_1743" "OTU_911"  "OTU_5370" "OTU_1477" "OTU_5022"
[86] "OTU_2008" "OTU_2241" "OTU_5046"

Any help to understand?

Thanks a lot!

Gian

benjjneb commented 3 years ago

The first diagnostic I might try is to remove the batch="Plate" parameter in the isContaminant call, and see if there is some weirdness associated with how batching is being done?

Outside of that, I do not see an obvious reason for what you are seeing, and you shouldn't be getting more contaminant calls than taxa present in the negative controls using the prevalence method alone.

If the batch investigation doesn't turn up a solution, would it be possibel for you to share this data object with me so I can poke at it?

Gian77 commented 3 years ago

Hello @benjjneb,

thanks for your email. I did a quick test without the batch option and the number of contaminants is lower than before but still way higher than 99...

I am happy to send you the data but since it is unpublished can I just send it to you, privately? I am not sure there is a way to do that in github...

Thanks much fr your help!

Gian

benjjneb commented 3 years ago

Sure, you can email me at benjamin DOT j DOT callahan AT gmail DOT com

Preferably summarized data (not raw data), for example the phyloseq object, along with the code you are running to see this result.

Gian77 commented 2 years ago

Awesome, thank you @benjjneb, an email is coming to your way.

Gian77 commented 2 years ago

Hello @benjjneb Just wondering if you recevied and got a chance to look into the data I sent you a while ago by email. Thanks a lot, Gian

benjjneb commented 2 years ago

Apologies, I did not yet. I transitioned computers about this time and fell out of my to-do list.

I found your previous email and will take a look.

Gian77 commented 2 years ago

Awesome, thanks a lot.

benjjneb commented 2 years ago

I think what is going on here is some issue with the phyloseq code and interpretation of those results, and not anything to do with decontam. My initial code and results (below) line up with what you posted:

library(phyloseq)
library(decontam)
setwd("~/Desktop/Gian/")

ps <- readRDS("physeq_uparse.rds")
table(sample_data(ps)$is.neg, useNA="always")
## FALSE  TRUE  <NA> 
## 155    10     0 

contam_uparse <-
   decontam::isContaminant(
     ps,
     method = "prevalence",
     batch = "Plate",
     neg = "is.neg",
     threshold = 0.1)
table(contam_uparse$contaminant) 
## FALSE  TRUE 
## 3947   776

However, I couldn't get your next code line/block to run, so I tried to recapitulate it with my own code:

psn <- prune_samples(sample_data(ps)$is.neg, ps)
table(taxa_sums(psn)>0)
## FALSE  TRUE 
##  2124  2599 

table(is.contam=contam_uparse$contaminant, 
    in.neg=taxa_sums(psn)>0)
##          in.neg
## is.contam FALSE TRUE
##     FALSE  2124 1823
##     TRUE      0  776

These results show that there are 2599 taxa present in the negative control samples, and that all taxa identified as contaminants by decontam-prevalence were also present in the negative control samples, as expected.

Gian77 commented 2 years ago

Thank you @benjjneb

I am kind of stil confused. I tried to run your code and my code again. I think the issue comes from prune_samples that pool out the wrong samples instead of the real controls, compared to subset_samples.

Please see the code below

> psn <- prune_samples(sample_data(ps)$is.neg, ps)
> psn
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4723 taxa and 10 samples ]
sample_data() Sample Data:       [ 10 samples by 16 sample variables ]
tax_table()   Taxonomy Table:    [ 4723 taxa by 13 taxonomic ranks ]
refseq()      DNAStringSet:      [ 4723 reference sequences ]
> table(taxa_sums(psn)>0)
FALSE  TRUE 
 2124  2599 
> psn_new <- subset_samples(ps, is.neg%in%"TRUE")
> psn_new
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4723 taxa and 10 samples ]
sample_data() Sample Data:       [ 10 samples by 16 sample variables ]
tax_table()   Taxonomy Table:    [ 4723 taxa by 13 taxonomic ranks ]
refseq()      DNAStringSet:      [ 4723 reference sequences ]
> table(taxa_sums(psn_new)>0)
FALSE  TRUE 
 4624    99 
> head(otu_table(psn))
OTU Table:          [6 taxa and 10 samples]
                     taxa are rows
         bcse131 bcse145 bcse63 bcse45 bcse104 bcse101 bcse115 bcse95 bcse81 bcse123
OTU_1497     328     373   6600    133    1309    1508      83  10760   5809      66
OTU_50       334    3485      0     42       0       0     101      0      0     277
OTU_31       509     178    517    465      11      16     301    413    273      11
OTU_290       50       2      0    406       0       0       0      0      0       0
OTU_191        2       6    674      3       1       0       0     12      9       0
OTU_539        0       0      7      0      35      74       0     10     51       0
> head(otu_table(psn_new))
OTU Table:          [6 taxa and 10 samples]
                     taxa are rows
         bcse122 bcse74 bcse157 bcse143 bcse112 bcse94 bcse60 bcse47 bcse26 bcse16
OTU_1497       0     81       0       0       0      0      0      0      0      0
OTU_50         1      0       0       0       0      1      0      0      0      0
OTU_31         0      0       0       0       0      0      0      0      0      0
OTU_290        0      0       0       0       0      0      0      0      0      0
OTU_191        0      0       0       0       0      0      0      0      0      0
OTU_539        0      0       0       0       0      0      0      0      0      0
> sample_data(ps)[1:12,]
Sample Data:        [12 samples by 16 sample variables]:
        BarcodeSequence   LinkerPrimerSequence bcsenumb marker Plate   Niche           Crop    Code    Plot Description
bcse26       ACTGCTATCG CTTGGTCATTTAGAGGAAGTAA   bcse26    ITS    P1 Control        Control Control Control          C-
bcse16       ATCTTGGAGT CTTGGTCATTTAGAGGAAGTAA   bcse16    ITS    P1 Control        Control Control Control          C-
bcse47       ACCTTGACAA CTTGGTCATTTAGAGGAAGTAA   bcse47    ITS    P1 Control        Control Control Control          C-
bcse60       CACACTGAAG CTTGGTCATTTAGAGGAAGTAA   bcse60    ITS    P1 Control        Control Control Control          C-
bcse94       GAGGTATTCT CTTGGTCATTTAGAGGAAGTAA   bcse94    ITS    P1 Control        Control Control Control          C-
bcse112      CCATATCCCG CTTGGTCATTTAGAGGAAGTAA  bcse112    ITS    P2 Control        Control Control Control          C-
bcse157      TCGGTAGCAA CTTGGTCATTTAGAGGAAGTAA  bcse157    ITS    P2 Control        Control Control Control          C-
bcse143      TACAAGTGGT CTTGGTCATTTAGAGGAAGTAA  bcse143    ITS    P2 Control        Control Control Control          C-
bcse122      TTGAGTGGTC CTTGGTCATTTAGAGGAAGTAA  bcse122    ITS    P2 Control        Control Control Control          C-
bcse74       GGTCATCACG CTTGGTCATTTAGAGGAAGTAA   bcse74    ITS    P1 Control        Control Control Control          C-
bcse64       TTCTTCTACC CTTGGTCATTTAGAGGAAGTAA   bcse64    ITS    P1    Leaf           Corn      G1      R4      LFG1r4
bcse152      CGTATGCCGT CTTGGTCATTTAGAGGAAGTAA  bcse152    ITS    P2    Soil Native Grasses      G7      R1      SLG7r1
        Richness   Shannon InvSimpson LibrarySize Index is.neg
bcse26         1 0.0000000   1.000000           1     1   TRUE
bcse16         2 0.6931472   2.000000           2     2   TRUE
bcse47         3 1.0986123   3.000000           3     3   TRUE
bcse60         1 0.0000000   1.000000          48     4   TRUE
bcse94         5 0.1617183   1.056123         148     5   TRUE
bcse112        4 0.6680875   1.746898         176     6   TRUE
bcse157        5 0.7642251   1.966285         216     7   TRUE
bcse143        5 0.7735465   2.040226         311     8   TRUE
bcse122       79 2.5729780   4.977436         432     9   TRUE
bcse74         6 1.5614922   4.475736         548    10   TRUE
bcse64        94 3.4320121  15.989638         662    11  FALSE
bcse152      726 5.0419726  52.839258       12776    12  FALSE

The real control samples are those from subset_samples thats is psn_new as you can see above from the sample_data(ps) above. All the controls match the samples I was able to pool out with subset_samples.

> sample_names(psn)
 [1] "bcse131" "bcse145" "bcse63"  "bcse45"  "bcse104" "bcse101" "bcse115" "bcse95"  "bcse81"  "bcse123"
> sample_names(psn_new)
 [1] "bcse122" "bcse74"  "bcse157" "bcse143" "bcse112" "bcse94"  "bcse60"  "bcse47"  "bcse26"  "bcse16" 

Am I wrong? I think decontam filter the ps object the same way... is that possible? Thanks a lot again for spending the time with this. Gian

benjjneb commented 2 years ago

OK, so there is something wrong with the phyloseq object itself -- the samples in the otu_table and sample_data are not in matched order...

head(sample_names(sample_data(ps)))
## [1] "bcse26"  "bcse16"  "bcse47"  "bcse60"  "bcse94"  "bcse112"

head(sample_names(otu_table(ps)))
## [1] "bcse131" "bcse145" "bcse63"  "bcse45"  "bcse104" "bcse101"

head(sample_names(ps))
## [1] "bcse131" "bcse145" "bcse63"  "bcse45"  "bcse104" "bcse101"

# Different sample order in sample_data!?!?

My understanding is that this shouldn't be possible in a phyloseq object, at least one constructed through normal operations. I don't really use phyloseq myself anymore, but a big part of its purpose was to take care of the harmonization between the different data objects, including consistent ordering of samples. As a quick check, I loaded up one of the data examples in the package, and it does have samples in matched order:

data(GlobalPatterns)
gp <- GlobalPatterns
head(sample_names(sample_data(gp)))
## [1] "CL3"     "CC1"     "SV1"     "M31Fcsw" "M11Fcsw" "M31Plmr"

head(sample_names(otu_table(gp)))
## [1] "CL3"     "CC1"     "SV1"     "M31Fcsw" "M11Fcsw" "M31Plmr"

head(sample_names(gp))
## [1] "CL3"     "CC1"     "SV1"     "M31Fcsw" "M11Fcsw" "M31Plmr"

There is a data-forensic question here: How did this phyloseq object get its otu table and sample data in mismatched order? But as for applying decontam to this phyloseq object, this is definitely a problem and almost certainly explains the aberrant results you are observing by causing the wrong samples to be identified as the controls.

Kudos to you for looking at your data critically and catching this. For moving forward with decontam, regenerating the phyloseq object so that samples are in matched order should do the trick. Alternatively, you can just pull out the necessary components into base R objects and do it by hand as well.

Gian77 commented 2 years ago

Hey @benjjneb

Thanks a lot and yeah, that's very interesting. I am also not using phyloseq very often, but I did for this dataset since involved other collaborators that were using it.

I indeed thought that phyloseq was able to reorganize the data itself, as you said, I think data organization and harmonization was the main purpose of that package that basically doesn't add anything new to vegan in terms of analysis power.

This is how I created the object:

mapping <-read.delim("clustering_test/mapping_ITS.txt", row.names=1, header=TRUE, sep="\t")
table_uparse <- read.delim("clustering_test/otu_table_ITS_UPARSE_R1.txt",row.names=1, header=TRUE, sep="\t") 
constax_uparse <-read.delim("clustering_test/constax_taxonomy_UPARSE.txt",header=TRUE, row.names=1, sep="\t")
otu_uparse <- readDNAStringSet("clustering_test/otus_R1.fasta", format="fasta", seek.first.rec=TRUE, use.names=TRUE)

physeq_uparse <-
  phyloseq(
    otu_table(table_uparse, taxa_are_rows = TRUE),
    sample_data(mapping),
    tax_table(as.matrix(constax_uparse)),
    otu_uparse)

I've laways used decontam through phyloseq, but I'd love to avoid it since I am using it less and less... Thanks again, I am glad we solved the mistery ;) Gian

benjjneb commented 2 years ago

Would you mind sharing the underlying files used to create the phyloseq with me? I'm still curious about how that is happening given what looks like standard construction of a phyloseq object. I'd be interested to poke at this example, and see if an issue needs to be posted to the phyloseq forum.

Gian77 commented 2 years ago

Sure, let me double check the files and I will write you an email.