campbio / celda

Bayesian Hierarchical Modeling for Clustering Single Cell Genomic Data
http://bioconductor.org/packages/celda
MIT License
148 stars 28 forks source link

using decontX on combined object, or on each sample individually? #334

Closed gruensch closed 3 years ago

gruensch commented 3 years ago

Hi, when validating our dataset, we found that genes that are specific for erythroid cells (like HBB), as well as immunoglobulins (e.g. IGHA1) appear quite frequently in cells that should not express these genes (Fig A+B).

Thus, I gave decontX a try. In the manual, it's stated that "decontX is run on cells from each batch separately." As we have run each 10x library prep separately per patient (unfortunately freezing / thawing was no option), each patient / library would be an independent batch. So I ran decontX on each sample independently using default internal clustering, which helped a little (Figure C+D). However, for samples, that did not have a more or less equal distribution of cell types (due to disease), decontX did not perform that nicely.

Since all cell types are well represented in the combined object, I gave it a try and ran decontX on the merged count matrix and used our annotation as clusters (z parameter). To my surprise, this worked very well (Fig E+F).

Hence my question to you: is it OK to use the latter approach, if we could validate that it helps removing the ambient RNA and makes sense biologically?

I have validated the approach by looking at the gene expression of well known Marker Genes within the umap space (looks very clean) and by checking which genes are most corrected for by subtracting the normalised rowSums of the contaminated (raw) count matrix by the rowSums decontaminated count matrix, which also looked very promising (full of hemoglobin and IG genes). Is there an even better way to validate the outcome?

decontX

(x axis are the different cell clusters)

joshua-d-campbell commented 3 years ago

Hi @gruensch, thanks for your question. Yes it is true that If some of the cell types were not well represented in the final filtered dataset but they were present in the cell suspension (e.g. red blood cells), then the original way of running decontX may not work well at subtracting out the counts for those marker genes. Although not ideal, I don't see too much of a problem with the way you ran it. However there is one alternative. We have released a new version on the master branch in GitHub where you can supply a raw matrix (i.e. the one that still contains empty droplets). By estimating the ambient RNA distribution from the empty droplets, decontX should work better in your scenario. Here is the code to install the latest version with an updated vignette:

install.packages(devtools)
library(devtools)
install_github("campbio/celda", build_vignettes = TRUE)
vignette("decontX")

And here is the code to run it with the raw matrix:

sce <- decontX(sce, background = raw)

where "raw" is a raw matrix with the same number of rows (with appropriate column names). Note that you will have to run it for each sample separately. You can import the raw matrix (with appropriate column and row names) using the function "importCellRanger" from the "singleCellTK" package:

library(singleCellTK)
raw <- importCellRanger("path/to/10X_Directory", dataType="raw")

FYI, I'm going to move this issue to the "Discussions" tab rather than "Issues". Just let us know if you have other questions!