JEFworks-Lab / STdeconvolve

Reference-free cell-type deconvolution of multi-cellular spatially resolved transcriptomics data
http://jef.works/STdeconvolve/
98 stars 12 forks source link

Match annotation and deconvolved cell types #43

Open xieaq opened 8 months ago

xieaq commented 8 months ago

Thanks for the tool, I am truly impressed by the remarkable findings presented in your work.

I have been working on implementing your simulation for MPOA. However, when attempting to compare the deconvolved cell-types with the ground truth, I've encountered a challenge. The heatmap generated does not exhibit a clearly defined diagonal block, making it difficult to match the deconvolved cell-types with the ground truth cell-types. In this context, I am seeking guidance on the process of matching the deconvolved cell-types and calculating the Root Mean Square Error (RMSE) concerning the ground truth for STdeconvolve.

image

Furthermore, I have been working through the GSEA tutorial to assign deconvolved cell types to their corresponding ground truth cell types. However, I have encountered challenges, as there are instances where the process does not yield the desired results.

Even when following the tutorial's provided examples, it remains a complex task to effectively match the deconvolved cell types to a specific annotation. I am seeking guidance on how to achieve a more reliable and accurate matching process.

image image

Any insights or suggestions you can provide regarding this issue would be immensely valuable to me.

Thank you once again for your time and assistance.

bmill3r commented 8 months ago

Hi @xieaq,

So sorry for the delay. I have been gone on vacation and just getting back now. Additionally, I am in a new position and have reduced capacity to respond to issues.

With respect to the difference between performing a transcriptional correlation or using GSEA to annotate the deconvolved cell types, If cell types are very distinct then I would expect the correlation and the GSEA to agree, but for more similar cell types there can be discrepancies.

In the case of the mOB, the mitral and outer plexiform layers are pretty similar transcriptionally, so when performing the GSEA, it’s likely that cell types 8 and 7 probably have transcriptional profiles that contain marker genes from both the outer plexiform and mitral cell layers. celltype_annotations$predictions returns the top predicted cell types based on GSEA, but it is worth looking at the celltype_annotations$results because it’s likely that you will see additional cell types that are significantly enriched. “predictions” breaks highly significant ties by looking at the enrichment scores, too, but this is only a prediction.

In your case, if you have the references for all of the cell types of interest, then I would recommend using the correlation between the reference cell type transcriptional profiles and the deconvolved cell type transcriptional profiles (the beta, not the theta). If you have cases where the correlations between a deconvolved cell type and reference cell types are very similar, then I would suggest using the GSEA to see which of the reference cell types may have the most marker genes enriched in the deconvolved cell type.

Hope this helps, Brendan

bmill3r commented 8 months ago

A few details about figure d in the paper:

Neuronal cell types were a major cell type in this dataset and contained the largest transcriptional variation in the data. We initially correlated against just the major cell types. But when we expanded the number of deconvolved cell types, we then obtained deconvolved cell types that highly correlated with the microglia and pericytes, likely because when using the major cell types, topics were first assigned to the neuronal cells, but once that variation was captured, topics were then assigned to cell types that represented less variation in the data. Pericytes and microglia were present at relatively small proportions, which is consistent with this explanation. You can check out the expanded correlation plot in the supplementary figures of the paper.

xieaq commented 7 months ago

Thank you for your kind assistance! I apologize for my delayed response and the numerous ongoing questions...

I conducted an analysis on one Bregma of the MPOA dataset, comparing the ground truth beta values with the simulated data. The results look promising! However, I remain uncertain if it's appropriate to simply conclude that 'deconvolved_cell_type_8' matches 'Pericytes.' This aspect consistently confuses me during the deconvolution process.

image

I'm also interested in learning how to compare the gene ranking, as shown in the paper. Unfortunately, I haven't been able to locate the relevant code for this purpose...

image
bmill3r commented 6 months ago

Hi @xieaq,

In this case I believe you are using the lsat hungarian sort algorithm to pair up the ground truth with the deconvolved cell type topics. While this algorithm attempts to pair up each of the ground truth cell types with a deconvolved one, it does not mean that the pairs are correct. In fact, the algorithm will attempt to create pairs regardless of the data, even if there are no correlations at all. It is a useful algorithm for visually organizing the data, but I would still defer to the actual correlations to decide which deconvolved cell type corresponds to a given ground truth. You can see that several deconvolved cell types correlate strongly with inhibitory and excitatory neurons. Here, LDA is restricted to 9 topics, but in this dataset there are 135 marker genes that were specifically chosen to identify neurons. So most of the transcriptional variation in the data is across neurons, so LDA keeps splitting the neuronal variation into more topics. With 9 topics, pericytes do not appear to correspond to any of the deconvolved cell types, likely because their variation in the data is too small for LDA to assign one of the 9 topics to it. I bet if you expand the number of topics, some of them will start to correlate with pericytes (and astrocytes, too). We talk more about this in the supplementary notes of the paper, and you can see in the supplementary figures that we do identify all of the major cell types and most of the neuronal subtypes if we expand the number of topics. This is a good example in which knowledge of the data (ie the marker genes chosen for neurons) can help instruct choice of the number of topics.

With respect to the gene ranking, we essentially used ggplot2::geom_hex() using the ranks of the genes for each deconvolved and ground truth cell type pair. For example:

matches <- stdecon_100um_9ct_annot ## the matched cell-types and ground truth labels, a named vector
m1 <- mpoaLDAs_100um_Deconv$`9`$beta ## deconvolved txn profiles beta matrix
m2 <- as.matrix(majorCellTypes_100um$gtCtGenes) ## ground truth gexp beta matrix

## make sure same genes and use this vector to keep genes in same order
sharedGenes <- intersect(colnames(m1), colnames(m2))
## rename deconvolved cell-types to correspond to the `matches`
rownames(m1) <- paste0("Cell-type ", rownames(m1))

summary_df <- do.call(rbind, lapply(names(matches), function(i){

        # get paired ground truth ct
        ct <- as.vector(matches[i])
        ## the deconvolved cell-type
        topic <- i

        topic_betas <- m1[topic, sharedGenes]
        ct_betas <- m2[ct, sharedGenes]

        ## values for a given topic=ground truth pair.
        ## append these together into final summary df
        df <- data.frame(gexp = ct_betas,
                         beta = topic_betas,
                         gexp_rank = rank(ct_betas),
                         beta_rank = rank(topic_betas),
                         pair = paste0(topic, " vs ", ct))

        ## plotting for a specific topic-ground truth cell-type comparison
        # x <- df$gexp_rank
        # y <- df$beta_rank
        # 
        # # prsn <- cor(x, y, method = "pearson")
        # # sprn <- cor(x, y, method = "spearman")
        # p <- ggplot(data = ct_df) +
        #         geom_point(aes(x = gexp_rank, y = beta_rank))
        #         # labs(title = paste("Beta pairs: Topic", topic, "vs", ct, "; cor =", round(corMtx[topic, ct],2), "\n",
        #         #                    "pearson =", round(prsn, 2), "spearman =", round(sprn, 2)))
        # print(p)
        # df
})) 

p3 <- ggplot2::ggplot(data = summary_df) +
  ggplot2::geom_hex(ggplot2::aes(x = gexp_rank, y = beta_rank), bins=10) +
  ggplot2::scale_fill_gradient(low = "white", high = "red") +
  ggplot2::labs(title = "Cell-type transcriptional profiles",
                        x = "Ground truth gene expression level",
                        y = "Deconvolved gene expression level") +
  ggplot2::theme_classic() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(size=15, color = "black"),
                 axis.text.y = ggplot2::element_text(size=15, color = "black"),
                 axis.title.y = ggplot2::element_text(size=15),
                 axis.title.x = ggplot2::element_text(size=15),
                 plot.title = ggplot2::element_text(size=0),
                 legend.text = ggplot2::element_text(size = 15, colour = "black"),
                 legend.title = ggplot2::element_text(size = 15, colour = "black", angle = 90),
                 panel.background = ggplot2::element_blank(),
                 plot.background = ggplot2::element_blank()
                 # legend.position="none"
                 ) +
  ggplot2::coord_equal() +

  ## fix up colorbar legend
  # ggplot2::scale_fill_gradientn(limits = c(-1,1),
  #                                breaks = c(-1,0,1),
  #                                colors=(grDevices::colorRampPalette(c("blue","white","red")))(n = 209)
  #                                  ) +
  ggplot2::guides(fill = ggplot2::guide_colorbar(title = "Count",
                                                 title.position = "left",
                                                 title.hjust = 0.5,
                                                 ticks.colour = "black",
                                                 ticks.linewidth = 2,
                                                 frame.colour= "black",
                                                 frame.linewidth = 2,
                                                 label.hjust = 0
                                                 ))
print(p3)

ggplot2::ggsave(filename = paste0("Figure2-mpoa_stdecon_9ct_txnCor_hex.pdf"),
                 device = "pdf",
                 # dpi = 300,
                 path = fig_path,
                 scale = 1,
                 width = 5,
                 height = 5,
                 units = c("in")
                )

Hope this helps, Brendan

xieaq commented 6 months ago

Hi Dr.@bmill3r

Thank you for your patience and assistance!

I have noticed that as the number of topics expanded, some of them began to correlate with pericytes.

However, I am curious about how to check the RMSE now. RMSE is computed for each pixel between the deconvolved and matched ground truth cell-type proportions in the ST dataset. Should I set the number of topics to be the same as the number of cell types, or is it acceptable to group some topics together when using more topics?

Your insights on this matter would be greatly appreciated!