JEFworks-Lab / STdeconvolve

Reference-free cell-type deconvolution of multi-cellular spatially resolved transcriptomics data
http://jef.works/STdeconvolve/
98 stars 12 forks source link

How to work on Slide-seq adata object? #32

Closed ShaowenJ closed 1 year ago

ShaowenJ commented 1 year ago

Dear STdeconvolve developers,

I am very interested in using STdeconvolve to explore my dataset, I know this maybe a silly question. But how to load Slide-seq data to the workflow of STdeconvolve? I am familiar with R and Python, maybe point me to some documents or tell me the strategy would be fine.

Thanks!

bmill3r commented 1 year ago

Hi @Lesdormis,

Thanks for your interest in STdeconvolve!

The tutorial Getting started with STdeconvolve might be a good place to start. You'll see that you will need the pixel x gene (bead x gene for Slideseq) count matrix. In the tutorial, this is the object cd. Note that the gene counts need to be the non-negative integer counts and not log transformed.

Let me know if this helps or if you have any other questions, Brendan

ShaowenJ commented 1 year ago

Hi @bmill3r Thanks for the quick response. I quickly try with the STdeconvolve followed the tutorial on our Slide-seq data and get some results, Does this figure suggest STdeconvolve not function well here? Or should I try different sets of parameters with Slide-seq platform? image

below is my code

# already filter
counts <- cleanCounts(counts = cd,
                      min.lib.size = 0,
                      min.reads = 0,
                      min.detected = 0,
                      verbose = TRUE)
## feature select for genes
corpus <- restrictCorpus(counts,
                         removeAbove = 1.0,
                         removeBelow = 0.05,
                         alpha = 0.05,
                         plot = TRUE,
                         verbose = TRUE)
# 447 genes with 18661 beads

ldas <- fitLDA(t(as.matrix(corpus)), Ks = seq(2, 9, by = 1),
               perc.rare.thresh = 0.05,
               plot=TRUE,
               verbose=TRUE)
## select model with minimum perplexity
optLDA <- optimalModel(models = ldas, opt = "min")

## extract pixel cell-type proportions (theta) and cell-type gene expression profiles (beta) for the given dataset
## we can also remove cell-types from pixels that contribute less than 5% of the pixel proportion
## and scale the deconvolved transcriptional profiles by 1000 
results <- getBetaTheta(optLDA,
                        perc.filt = 0.05,
                        betaScale = 1000)
deconProp <- results$theta
deconGexp <- results$beta
vizAllTopics(deconProp, pos, 
             r=0.4)

Thanks!!!

bmill3r commented 1 year ago

Hi @Lesdormis,

It looks like STdeconvolve is working, but I think the issue here is just a visualization one. Basically, the width of the outer rings of the scatterpies are probably too large and blocking you from seeing the cell type proportions. Adding the argument lwd=0.0 in vizAllTopics() should remove the outer rings allowing you to see the proportions.

vizAllTopics(deconProp, pos, 
             r=0.4.
             lwd=0.0)

However, I noticed that the number of deconvolved cell types is 2, which I don't think will be very informative so I would suggest using a model with a higher K, which you can choose using:

optLDA <- optimalModel(models = ldas, opt = K)

where K is a positive integer. In your case, you fitted LDA models with K's 2 through 9. I would recommend looking at the perplexity plot that was returned by fitLDA() to help guide your choice of K.

You can also regenerate the plot via:

perplexityPlot(ldas)

One last note: if you have an idea of how many cell types you might expect in the given tissue, then I would recommend training a set of LDA models with K's around this value in the function fitLDA()

Hope this helps and let me know if you have any other questions, Brendan

ShaowenJ commented 1 year ago

Hi @bmill3r, Thanks for your suggestions! These are lymph node data, and we are hoping to see whether we can see one-cell per bead resolution. But my preliminary analysis shows that we could see some beads mixed with two cells, like T-B cells, Macrophage-T cells. So we are trying to if we can recover this information instead of using immune markers expression.

So here is the fitLDA figure, image Maybe the dataset is not fittable for choosing a good k.

Thanks!

bmill3r commented 1 year ago

Hi @Lesdormis,

Thanks for providing additional information about your analysis! Just to clarify, K indicates the total number of cell types you are trying to deconvolve in the dataset as a whole, not what might be in each individual bead. But given it is Slide-seq data, you can filter the theta bead-cell type proportion matrix to only keep the top two or three cell types in each bead, for example.

Here is a function in the devel branch of the package that might be useful for this:

#' Function to reduce theta matrices down to the top X cell-types in each
#' pixel. 
#' 
#' @description The cell-types with the top X highest proportions are kept in each
#'     pixel and the rest are set to 0. Then renormalizes the pixel proportions to sum to 1.
#'     Cell-types that result in 0 in all pixels after this filtering step are removed.
#'
#' @param theta pixel (rows) by cell-types (columns) distribution matrix. Each row
#'     is the cell-type composition for a given pixel
#' @param top Select number of top cell-types in each pixel to keep (default: 3)
#' 
#' @return A filtered pixel (rows) by cell-types (columns) distribution matrix.
#' 
#' @noRd
reduceTheta <- function(theta, top=3){

  theta_filt <- do.call(rbind, lapply(seq(nrow(theta)), function(i){
    p <- theta[i,]
    thresh <- sort(p, decreasing=TRUE)[top]
    p[p < thresh] <- 0
    p
  }))

  colnames(theta_filt) <- colnames(theta)
  rownames(theta_filt) <- rownames(theta)

  theta_filt <- theta_filt/rowSums(theta_filt)
  ## if NAs because all cts are 0 in a spot, replace with 0
  theta_filt[is.na(theta_filt)] <- 0
  ## drop any cts that are 0 for all pixels
  theta_filt <- theta_filt[,which(colSums(theta_filt) > 0)]

  return(theta_filt)
}

With respect to the plot from perplexityPlot(), it looks like your set of selected informative genes (features) in the data is not sufficient for the model to pick up on distinct cell types. In my experience, this can happen with sparse data that is typical of Slide-seq. Ideally you want to identify models with an alpha < 1 and are not grayed out on the plot. You can try changing some of the parameters in the feature selection step to see if this helps. For example, there might be beads with poor capture efficiency, or genes that could be detected as overdispersed but may still have been captured at a level that is too low. You could also try being more conservative in terms of selecting informative genes.

There is a more comprehensive all-in-one function you can use for dataset filtering and feature selection:

corpus <- STdeconvolve::preprocess(cd,
                       removeAbove = 0.9, # remove genes in more that 90% of beads
                       removeBelow = NA, 
                       min.detected = 100, # minimum number of beads a gene must be present in
                       min.reads = 100, # minimum number of total counts across beads to keep a gene
                       min.lib.size = 100, # remove beads if they have less than 100 total gene counts
                       ODgenes = TRUE,
                       nTopOD = 400,
                       od.genes.alpha = 0.01, # be a little more conservative on the OD genes
                       verbose = TRUE)

Where cd is the original bead (row) x gene (columns) matrix with gene counts. (make sure it is in the right orientation)

With 18661 beads in your dataset, removing genes in less than 5% by default means that if a gene was detected in less than 933 beads it will be removed, but perhaps that is too stringent of a cutoff in this case. A specific cell type could very well be present in a subset of beads smaller than 900. So above, I set removeBelow=NA, but still remove genes if they are in less than 100 beads (min.detected = 100), for example. I also define poorly detected genes as those with less than 100 total counts in the entire dataset, but you can adjust this depending on the distribution of total counts for each gene in this dataset to determine what an appropriate cutoff might be.

Also, I set min.lib.size = 100 to remove beads that had less than 100 total gene counts and presumably have poor capture efficiency. You might want to adjust this number depending on the distribution of library sizes for the beads in your dataset.

As an example, I still restrict to the top 400 overdispersed genes by setting nTopOD=400 to prevent the number of informative features from becoming too large, and I am slightly more conservative in terms of selecting overdispersed genes by setting od.genes.alpha = 0.01, which adjusts the significance threshold.

Let me know if this helps! Brendan