JEFworks-Lab / STdeconvolve

Reference-free cell-type deconvolution of multi-cellular spatially resolved transcriptomics data

http://jef.works/STdeconvolve/

98 stars 12 forks source link

Giotto Object to STdeconvolve, and general questions about workflow, choosing K, cell type annotation, etc #35

Open chrkuo opened 1 year ago

chrkuo commented 1 year ago

Thank you so much for creating this tool for deconvolution without scRNAseq data.

I am working with an extremely rare pediatric tumor where the samples are few and far between and there are basically no publicly available scRNAseq reference dataset.

I am currently running my samples through Visium platform for SRT and using Giottosuite to ultimate analyze my data. The output is a Giotto object. I have read through the documentation using 10x data but I wasn't sure how to utilize Giotto object in the code for setting up to perform STdeconvolve. Can I please get some assistance.

Thank you so much I cannot wait to use STdeconvolve

bmill3r commented 1 year ago

Hi @chrkuo,

Thanks for reaching out and your interest in using STdeconvolve!

STdeconvolve doesn't have functions that directly interface with Giotto objects, however, all you really need is the raw gene x barcodes counts matrix (where the counts are non-negative integers). Because you're already using Visium to generate your data, this should just be the sparse count matrix from the Visium analysis.

You can first create a SpatialExperiment object:

se <- SpatialExperiment::read10xVisium(samples = f,
     type = "sparse",
     data = "filtered")
se

where f is the path to the folder that contains the Visium feature matrices.

Then you can extract the counts matrix:

## this is the genes x barcode sparse count matrix
cd <- se@assays@data@listData$counts

I'm sure Giotto probably has a way in which you can access the Visium raw counts matrix you used to make the object initially, I'm just not as familiar with the structure of Giotto objects to know off the top of my head. But if you are able to share some code showing how you made the Giotto object this might also be helpful.

Thanks! Brendan

chrkuo commented 1 year ago

@bmill3r

Thank you so much. I spent the whole morning going through the documentation of for 10x and it's been quite helpful but I ran into an error with a few questions.

the giotto code that I use to generate the giotto object is below:

data_path_EWS4512_h5 = "/Users/chrkuo/Library/CloudStorage/OneDrive-ChildrensHospitalLosAngeles/Chris/Visium\ Data/EWS4512/outs/raw_feature_bc_matrix.h5"

data_path_EWS4512_tissue_positions = "/Users/chrkuo/Library/CloudStorage/OneDrive-ChildrensHospitalLosAngeles/Chris/Visium\ Data/EWS4512/outs/spatial/tissue_positions_list.csv"

data_path_EWS4512_image = "/Users/chrkuo/Library/CloudStorage/OneDrive-ChildrensHospitalLosAngeles/Chris/Visium\ Data/EWS4512/outs/spatial/tissue_hires_image.png"

data_path_EWS4512_scalefactors = "/Users/chrkuo/Library/CloudStorage/OneDrive-ChildrensHospitalLosAngeles/Chris/Visium\ Data/EWS4512/outs/spatial/scalefactors_json.json"

create Giotto objects

object has data in data frames - in columns and rows

EWS4512 <-createGiottoVisiumObject(h5_visium_path = data_path_EWS4512_h5, h5_tissue_positions_path = data_path_EWS4512_tissue_positions, h5_image_png_path = data_path_EWS4512_image, h5_json_scalefactors_path = data_path_EWS4512_scalefactors, instructions = instrs)

I would appreciate it if there's a way to somehow utilize the Giotto object to run STDeconvolve?

Regarding the initial loading of data to run STDeconvolve: are we using raw unnormalized/unfiltered data set and plug it into STDeconvolve since there's clustering at the end etc. or are we using already processed normalized/scaled/filtered data into STDeconvolve?
for some reason after I visualize the expression of some of the genes the output graphs are all varying sizes compared to your graph listed in the documentation (every graph was evenly squared in yours). example given here:

I'm wondering if there's a way to make it the same size or bigger or perhaps just see first 4 genes?

I am hitting an error code: Error in data.frame(emb1 = pos[, "x"], emb2 = pos[, "y"], Cluster = tempCom) : arguments imply differing number of rows: 1437, 1375

right after trying to create the data.frame using the louvian clustering.

Finally, let’s use louvian clustering to assign the barcodes into 15 communities.

k <- 35 com <- MERINGUE::getClusters(pcs, k, weight=TRUE, method = igraph::cluster_louvain)

Let’s visualize the communities in terms of the spatial positions of the barcodes:

tempCom <- com dim(emb)

dat <- data.frame("emb1" = pos[,"x"], "emb2" = pos[,"y"], "Cluster" = tempCom) do you have any suggestions on how to work through this error? it seems like the emb matrix has 1375 rows but the pos has 1437 rows so they are not matching up. wondering why this is the case

I see that there are 15 cell types currently identified in my tissue - how do i ultimately figure out what each cell types are? It seems like the documentation for 10x visium ends at generating graphs but doesn't go into annotation of these cell types.

Thank you so much for your time and assistance and I am a clinician just getting into bioinformatics so i sincerely apologize if a lot of these questions are basic code questions that I just don't know the answers to.

bmill3r commented 1 year ago

Hi @chrkuo,

It looks like you have made some progress with STdeconvolve and you have some interesting cell type patterns in your data! In terms of annotating them, for now I'll point you towards: Annotating deconvolved cell-types to help give you some strategies for making sense of them.

As for your other questions, I'll try to get to them tomorrow, but given that you were able to deconvolve cell types, it seems that loading the data into STdeconvolve is no longer an issue? I agree it would be nice to add Giotto to our list of objects we can interface with. Right now I've only gotten around to doing this for Seurat and SpatialExperiment objects.

In terms of the initial loading of data, you want the unnormalized/unfiltered data, which is non-negative integer counts of genes in the barcodes. The reason why STdeconvolve requires non-negative integers basically boils down to the fact that latent Dirichlet allocation requires frequency counts of words or terms, specified as a matrix of nonnegative integers. In this case, our terms are genes, but the same idea holds. LDA looks for terms that form clusters in each barcode or capture location but the read depths don't need to be normalized as one might do for scRNA-seq data via CPM normalization, for example. STdeconvolve does have a pre-filtering step to remove poorly captured or non-informative genes.

Will look into the plotting and other errors tomorrow.

Hope this helps for now, Brendan

chrkuo commented 1 year ago

@bmill3r thank you so much.

I hope it's okay that I add a few more questions in the mean time.

Aside from filtering via pixel are there any additional methods to filter?

Is there a way to graph out the fitted model K's vs. deconvolved cell types and perplexity so I can select the best K? It wasn't in the documentation

My current data has quite high perplexity although the # of cell type on the left has been minimal (I tested various Ks up to 30 and seems like the perplexity ranges around 300-400). Is this just the inherent nature of tumor? Does this reflect poor quality?

It seems like the pixel positions object (pos) has 1437 rows and the emb object has 1375. Is this discrepancy due to the fact that some of the pixels have genes that were filtered out?

I played around a bit and used this code to circumvent the error:

pos_subset <- pos[tempCom,] dat <- data.frame("emb1" = pos_subset[, "x"], "emb2" = pos_subset[, "y"], "Cluster" = tempCom)

But the graph I got was extremely erroneous

I'm only graphing out 1 point per cluster which is extremely weird. And it's not on my tissue. seems like there are duplicates in the data frame?

Thank you again - I realize this became more of a trouble shooting instead of Giotto Object question now.

chrkuo commented 1 year ago

@bmill3r wanted to check in and follow up

Sorry I know i asked a few questions. Truly appreciate your time and assistance!!!

bmill3r commented 1 year ago

Hi @chrkuo

No problem about the questions - here I'll try to go through them based on my understanding of what I think you're asking:

I would appreciate it if there's a way to somehow utilize the Giotto object to run STDeconvolve?

STdeconvolve doesn't have a way to directly interface with Giotto objects, but it seems that you were able to obtain the pixels (i.e., barcodes) x genes count matrix and supply this as input into STdeconvolve to obtain predicted cell type proportions in the pixels

Regarding the initial loading of data to run STDeconvolve: are we using raw unnormalized/unfiltered data set and plug it into STDeconvolve since there's clustering at the end etc. or are we using already processed normalized/scaled/filtered data into STDeconvolve?

In terms of the initial loading of data, you want the unnormalized/unfiltered data, which is non-negative integer counts of genes in the barcodes. The reason why STdeconvolve requires non-negative integers basically boils down to the fact that latent Dirichlet allocation requires frequency counts of words or terms, specified as a matrix of nonnegative integers. In this case, our terms are genes, but the same idea holds. LDA looks for terms that form clusters in each barcode or capture location but the read depths don't need to be normalized as one might do for scRNA-seq data via CPM normalization, for example. STdeconvolve does have a pre-filtering step to remove poorly captured or non-informative genes.

for some reason after I visualize the expression of some of the genes the output graphs are all varying sizes compared to your graph listed in the documentation

I agree that is odd behavior but likely has something to do with the ggplot2 parameters when drawing the different panels. There are a lot of parameters for the plotting so it's hard for me to diagnose this without seeing your code. Are you able to provide the code you ran to produce this plot?

Depending on how you are generating this plot, there are several ways you can make it larger. If you are trying to save it as a pdf, you can do something like:

ggplot2::ggsave(ggplot2::ggsave(filename = "genes.pdf",
                device = "pdf",
                scale = 1,
                width = 8,
                height = 8,
                units = c("in"))

width anf height can control the dimensions of the file plot.

Also, for example in the

gridExtra::grid.arrange(
  grobs = ps,
  layout_matrix = rbind(c(1, 2, 3, 4),
                        c(5, 6, 7, 8),
                        c(9, 10, 11, 12),
                        c(13, 14, 15, 16))
)

section of the code in the Analysis of 10X Visium data tutorial, you could try specifying plot heights and widths by adding parameters heights and widths (https://www.rdocumentation.org/packages/gridExtra/versions/2.3/topics/arrangeGrob)

If you just want to see the first 4 genes for example, then in this same plotting example in the 10X tutorial, markerGenesis just a vector genes that the function is looping through to generate panels for. So you can just restrict this vector to the first 4 genes: markerGenes[1:4].

Alternatively, you can also just run the function vizGeneCounts() by itself, and explicitly state what gene you would like to generate a plot for:

vizGeneCounts(df, gene, size = 2, stroke = 0.1, plotTitle = gene, winsorize = 0.05, showLegend = TRUE)

I am hitting an error code:

Error in data.frame(emb1 = pos[, "x"], emb2 = pos[, "y"], Cluster = tempCom) : arguments imply differing number of rows: 1437, 1375

do you have any suggestions on how to work through this error? it seems like the emb matrix has 1375 rows but the pos has 1437 rows so they are not matching up. wondering why this is the case

To get the communities from transcriptionally clustering the pixels, the input is the PCs. From the 10X Visium example:

pcs.info <- stats::prcomp(t(log10(as.matrix(counts)+1)), center=TRUE)
nPcs <- 7 ## let's take the top 5 PCs
pcs <- pcs.info$x[,1:nPcs]

Recall that counts is the matrix after filtering:

counts <- cleanCounts(cd, min.lib.size = 100, min.reads = 10)

So my guess is that pos are the positions of all barcodes in your datasets but counts, and therefore pcs and emb are after filtering and why they have less rows.

If counts is a barcodes x genes matrix, you can get the positions of the barcodes after filtering via:

pos[rownames(counts),]

I see that there are 15 cell types currently identified in my tissue - how do i ultimately figure out what each cell types are? It seems like the documentation for 10x visium ends at generating graphs but doesn't go into annotation of these cell types.

I would recommend checking out the Annotating deconvolved cell-types tutorial.

Aside from filtering via pixel are there any additional methods to filter?

The function cleanCounts removed both pixels with low numbers of genes, and genes that have been poorly captured in the datasets.

restrictCorpus() filters the genes to those that are overdispersed, which we assume are the most informative genes and recommend using as a proxy for cell type genes when trying to detect cell types in the data using STdeconvolve.

Once you've picked an optimal LDA model and obtain the beta and theta matrices, topGenes() returns the top expressed genes for each deconvolved cell type. You can specify how many top genes per cell type but the default is 10.

Is there a way to graph out the fitted model K's vs. deconvolved cell types and perplexity so I can select the best K? It wasn't in the documentation

Yes - in the tutorial Getting started with STdeconvolve you'll see that typically one will fit multiple LDA models to the data using the function fitLDAs(). Setting plot=TRUE will return a plot indicating the perplexity of each fitted model and the number of rare cell types. This plot is useful to help choose the optimal LDA model for the dataset being analyzed.

My current data has quite high perplexity although the # of cell type on the left has been minimal (I tested various Ks up to 30 and seems like the perplexity ranges around 300-400). Is this just the inherent nature of tumor? Does this reflect poor quality?

In general, perplexity is how well the model fits the data where the lower the perplexity, the better. However, when looking at a specific dataset, the absolute perplexity range doesn't matter that much - it's more about choosing a model with the lowest perplexity while balancing a relatively low number of rare cell types. As the K increases, perplexity tends to decrease, but the number of rare cell types also increases, which suggests over splitting of the data. So it's a balance between these two metrics but one that each user will ultimately need to decide on.

That said, your observed perplexity range of 300-400 is higher than what I have seen more more organized tissues like brain samples, so perhaps this is also reflective of the inherent heterogeneous and undifferentiated state of cancer cells (with less distinct cell types, it becomes harder for the model to deconvolve them). That said, it looks like the model is still able to deconvolve distinct cell types so the data is of sufficient quality and there are distinct cell type signatures that are detectable.

It seems like the pixel positions object (pos) has 1437 rows and the emb object has 1375. Is this discrepancy due to the fact that some of the pixels have genes that were filtered out?

See response to question 4.

Hope this helps, Brendan

chrkuo commented 1 year ago

@bmill3r

that was extremely helpful and thorough explanation thank you.

it seems like the more heterogeneous my tissue is the harder it is to deconvolve them - this assumption is correct right?

it seems like there are two methods of annotating cell types: the first is through transcriptional correlation with "ground truths" - my question is though. is the ground truths what I deem as ground truths or it's a reference data set that I upload and use? If it is what I deem to be ground truths then how do i "generate" the matrix? the annotation vignette link doesn't show how you generated the annotated cell matrix - it would be helpful if there's a vignette on how these matrices are generated.

The second strategy is through GSEA - again using reference gene sets that I define as a specific cell type ie. t cell or macrophages etc - similar question how do i generate such data is there vignette or documentation in order to incorporate into the annotation step?

My other question is - even though i identified 15 cell types as shown here:

the end correlation matrix i generated seems to have only included 10 cell types - is the program inherently eliminating other cell types? and if so what's the reason ?

Thank you so much and I am extremely thrilled that this tool has been so streamlined and efficient to use. (quite detailed documentation as well)!!

chrkuo commented 1 year ago

@bmill3r wanted to follow up regarding how to generate such matrix or gene sets for deconvolution annotation. I'm sort of stuck at this step. would really appreciate it if there's some step by step guidelines on how to generate such matrix between transcriptional approach and GSEA approach.

I've explored other vignettes/open issues/blog linked here as well (https://jef.works/blog/2022/08/30/annotating-STdeconvolve-cell-types-with-asctb-tables/) but all of them see to upload some sort of reference matrix.

If I have a set list of genes ie. Macrophages = c("PTPRC", "CD163", "MSR1", "CD14", "CD68", "AIF1", "CSF1R", "CD69", "APOC1") M1_Macrophage_General = c("IFNG", "IL1B", "IL6", "IL12", "IL15", "IL18", "IL23", "TNF", "NOS2", "IDO1", "FAM26F", "CXCL9", "CXCL10", "CXCL11") M2_Macrophage_metastasis = c("EGF", "CSF1", "CCL18", "TGFB", "MMP2", "MMP3", "MMP7", "MMP9", "CD163", "MSR1", "MRC1", "TGM2", "FABP4", "CCL24", "CCL26")
M2_TumorGrowth = c("IL6", "CXCL8", "IL10", "IL17", "IL23") M2_Angiogenesis = c("VEGF", "CXCL8", "PDGF", "MMP2", "MMP3", "MMP7", "MMP9", "ANG2", "CCL5", "CCL2") M2_Immunosuppression = c("ARG1", "MRC1", "ARG2", "IL10", "TGFB", "CCL17", "CCL18", "CCL22") Monocyte_M2 = c("CSF1", "CCL2", "IL4", "IL13", "IL10", "TGFB") M2_polarization = c("IL4", "JAK1", "STAT6", "PI3K", "PIK3CA", "AKT1", "MYC", "PPARG")

can i use this to use as ground truths? or as GSEA. if so how do i do so?

bmill3r commented 1 year ago

Hi @chrkuo

To answer your questions:

it seems like the more heterogeneous my tissue is the harder it is to deconvolve them - this assumption is correct right?

It is not the heterogeneity but more the transcriptional "distinctness" of the cell types. STdeconvolve is looking for groups of genes that occur together at high probability and uses these groups to identify topics, or cell types. Ideally, there will be genes that specifically cluster together to form the cell type topics we're looking for among the capture locations. In reality, however, while a group of genes may occur together at high probability, they also probably occur at some frequency with genes of other clusters. Cases where genes are spread across, or co-occur with lots of other genes, make it harder for the model to assign genes to specific topics with high confidence.

If we think about this in terms of cell types, cell types that have very distinct transcription profiles will likely have specific genes that strongly co-occur with each other (i.e., the genes being expressed specifically by the given cell type) but not with other genes expressed by other cell types. These will be easier for the model to detect, assuming that these cell type marker genes were efficiently captured in the spatial experiment. Likewise, cell types that are transcriptionally similar will likely share many genes that co-occur together across cell types, making it more difficult for LDA to assign these genes into specific groups to identify these cell types.

Some cancers can be undifferentiated, and these cells, having been transformed into a less distinct transcriptional state, may be more difficult for STdeconvolve to detect. Indeed, we demonstrate this idea here: Examples of when STdeconvolve may fail.

is the ground truths what I deem as ground truths or it's a reference data set that I upload and use? If it is what I deem to be ground truths then how do i "generate" the matrix?

Because STdeconvolve identifies cell types in a reference-free manner, users can use any method(s) they would like to identify the deconvolve cell types. If you have a cell type x gene count matrix for other known cell types, then you can correlate the deconvolved cell type transcriptional profiles (i.e., the beta matrix output from STdeconvolve) with the transcriptional profiles of the known cell types. You can get the deconvolved transcriptional profiles from getBetaTheta(). Then, you can get the correlations between both matrices with the function:

getCorrMtx(m1 = beta,
m2 = as.matrix(countMatrix), ## this is the cell type x gene count matrix for other known cell types
type = "b"

In the Annotating deconvolve cell-types tutorial, there is no reference for the mOB dataset, but for demonstration purposes I create one for the tissue layers just to show how one can perform a Pearson correlation analysis between the deconvolved cell types and some other matrix of cell type transcriptional profiles. A scRNA-seq dataset should have this gene count matrix for the individual cells. With cell type annotations, you can sum together the gene counts for cells belonging to the same cell type.

again using reference gene sets that I define as a specific cell type ie. t cell or macrophages etc - similar question how do i generate such data is there vignette or documentation in order to incorporate into the annotation step?

As you've seen, instead of performing correlations between transcriptional profiles, one can look for enrichment of certain cell type marker genes in the deconvolved transcriptional profiles to help identify the deconvolved cell types.

In the Annotating deconvolve cell-types tutorial, for demonstration purposes, I use the top log2 fold-change genes for each tissue layer of the mOB dataset to generate my list of marker genes, which is used as the known gene set input for the function annotateCellTypesGSEA(). This is essentially just a list, where each entry is a vector of genes, and the entry name is some cell type name that the genes are markers for.

It actually looks like you are almost there with your provided gene sets for the different cell types you have provided. You can just turn that into a list via:

gset <- list(
Macrophages = c("PTPRC", "CD163", "MSR1", "CD14", "CD68", "AIF1", "CSF1R", "CD69", "APOC1"),
M1_Macrophage_General = c("IFNG", "IL1B", "IL6", "IL12", "IL15", "IL18", "IL23", "TNF", "NOS2", "IDO1", "FAM26F", "CXCL9", "CXCL10", "CXCL11"),
M2_Macrophage_metastasis = c("EGF", "CSF1", "CCL18", "TGFB", "MMP2", "MMP3", "MMP7", "MMP9", "CD163", "MSR1", "MRC1", "TGM2", "FABP4", "CCL24", "CCL26"),
M2_TumorGrowth = c("IL6", "CXCL8", "IL10", "IL17", "IL23"),
M2_Angiogenesis = c("VEGF", "CXCL8", "PDGF", "MMP2", "MMP3", "MMP7", "MMP9", "ANG2", "CCL5", "CCL2"),
M2_Immunosuppression = c("ARG1", "MRC1", "ARG2", "IL10", "TGFB", "CCL17", "CCL18", "CCL22"),
Monocyte_M2 = c("CSF1", "CCL2", "IL4", "IL13", "IL10", "TGFB"),
M2_polarization = c("IL4", "JAK1", "STAT6", "PI3K", "PIK3CA", "AKT1", "MYC", "PPARG")
)

the end correlation matrix i generated seems to have only included 10 cell types - is the program inherently eliminating other cell types? and if so what's the reason ?

It looks like you have been following the Analysis of 10X Visium data tutorial. Note that the section that contains the plot you have provided is part of the analysis belonging to the "Compare to transcriptional clustering" section. This part is not about annotating the deconvolved cell types in the Visium Cortex data but rather comparing and contrasting the deconvolved cell types to the traditional way of just transcriptionally clustering the barcodes of the dataset. You can learn more about these differences in another demo we provide here.

Hope this helps clarify things, Brendan

chrkuo commented 1 year ago

Thank you so much Brendan @bmill3r, as always really appreciate your prompt and thoughtful responses. This has been extremely helpful. and yes your answers absolutely makes sense.

As I work through it I have a few more questions (apologize in advance).

I generated a graph looking at perplexity and cell types with mean proportion <5% and I don't see a gray zone area indicating alpha >1 which means that any of these k values that I pick should be good to explore correct?

I picked a K of 12 and subsequently had 12 celltypes identified as shown here:

As you have suggested I have been able to curate a list of genes that I deem to be cell type specific and input them into a matrix as such:

# Create a nested list
cell_genes_nested <- list(
  "Ewing_sarcoma" = list(
    "character" = c("CD99", "NKX2-2", "FLI1", "FLI-1")
  ),
  "CD99" = list("character"=c("CD99")
                ),
  "NKX" = list("character"=c("NKX2-2")
               ),
  "EwS_Dormant_short" = list(
    "character" = c("CD99", "NKX2-2", "FLI1", "FLI-1", "GBA", "CD63", "GABARAPL2", "SEC61B", "RTN4", "CLIMP-63", "TGFBR2", "TGFBR3", "CDKN2A", "FN1", "SERPINE2", "TIMP1", "ITGA4",
                    "CALD1", "COL3A1", "COL5A2", "GSN", "MYL6", "MYL12A", "SPARC", "TWIST1", "VCL", "MAPILC3B", "MAPILC3B2", "CDKN1A", "FAM134B")
  ),
  "EwS_Stem_short" = list(
    "character" = c("CD99", "NKX2-2", "FLI1", "FLI-1", "CELSR2", "CD40", "ALDH1A1", "ALDH2", "BRIX1", "CCND1", "CDC4", "CDK1", "COL21A1", "CTNNA1", "CXCR4", "DLL3", "FGFR1",
                    "FN1", "FZD1", "KAT8", "KIT", "MYC", "NEUROD6", "PARD6B", "PTEN", "SOX2", "S1")
  ),
  "NT5E_List" = list(
    "character" = c("NT5E", "BGN", "CAMK2D", "CD63", "COL1A1", "COL1A2", "COL5A2", "COL6A2", "CTSB", "CTSZ", "EMP1", "FBN1", "FSTL1", "GPNMB", "LGMN", "MYOF", "NID1", "PLAC9", "PPIC", "PRSS23", "S100A10", "S100A11", "S100A13", "S100A4", "SPARC", "TGFB1I1", "THY1", "TNC")
  ),
  "NT5E_EwS" = list(
    "character" = c("CD99", "NKX2-2", "FLI1", "FLI-1", "NT5E")
  ),
  "B_cells" = list(
    "character" = c("PTPRC", "MS4A1", "CD19", "CD79A", "CD79B", "CD52", "BANK1", "IGHD", "IGHM", "CD69", "CD83")
  ),
  "Plasma_cells" = list(
    "character" = c("PTPRC", "CD38", "SDC1")
  ),
  "T_cells" = list(
    "character" = c("PTPRC", "CD3D", "CD3E", "CD3G", "CD4", "CD8A", "CD8B", "FOXP3", "IL2RA", "CD28")
  ),
  "NK_cells" = list(
    "character" = c("PTPRC", "KLRK1", "NCR1")
  ),
  "Macrophage_M1_KB" = list("character"=c("SPP1", "FTL", "APOC1", "MT1X", "FABP5", "LGALS1", "FTH1", "MT1G", "APOE", "CSTB")
),
  "Macrophage_M2_KB" = list("character"=c("PDK4", "RNASE1", "SELENOP", "C1QA", "CCL3", "C1QB", "APOE", "FOS", "SLC40A1", "PSAP")
                            ),
  "CD8_Tcell"=list("character"=c("NKG7", "CCL5", "GZMA", "GZMK", "CCL4", "GNLY", "CST7", "IL32", "CTSW", "CD3D")
                   ),
  "Activated_CD4Tcell" =list("character"=c("CD69", "BTG1", "RPS27", "RPL3", "RPS3", "RPL13", "TRAC", "RPS18", "RPS29", "CD52")
                             ),
  "cDC2" = list("chracter"=c("FCER1A", "HLA-DQA1", "HLA-DQB1", "HLA-DRA", "HLA-DPB1", "HLA-DQB2", "HLA-DPA1", "HLA-DRB1", "LYZ", "S100B")
  ),
  "plasmacytoid Dendritic" =list("character"=c("IGKC", "JCHAIN", "GZMB", "PTGDS", "IRF7", "PLAC8", "GPR183", "CLIC3", "ALOX5AP", "IRF8")
                                 ),
  "Monocyte_KB" = list("character"=c("LST1", "FCN1", "SMIM25", "IFITM2", "COTL1", "CTSS", "FCGR3A", "SERPINA1", "S100A6", "S100A4")
                       ),
  "cDC1" = list("character"=c("CPVL", "HLA-DPB1", "HLA-DQA1", "HLA-DPA1", "SNX3", "IRF8", "LYZ", "HLA-DQB1", "HLA-DRB1", "DNASE1L3")
                ),
  "Monocytic_MDSC" = list(
    "character" = c("PTPRC", "ITGAL", "ITGAM", "ITGAX", "ITGAD", "FUT4")
  ),
  "Promyelocytic_MDSC" = list(
    "character" = c("PTPRC", "FUT4")
  ),
  "Monocytes" = list(
    "character" = c("PTPRC", "CD14", "FCGR3A", "FCGR3B", "TIMP1", "CD44", "G0S2")
  ),
  "Neutrophils" = list(
    "character" = c("PTPRC", "CEACAM1", "CEACAM6", "CEACAM3", "CEACAM8", "CEACAM5", "PSG1")
  ),
  "Dendritic_cells" = list(
    "character" = c("PTPRC", "CD1A")
  ),
  "Endothelial_cells" = list(
    "character" = c("PECAM1", "FLT1", "PTPRB", "EGFL7")
  ),
  "Pericytes" = list(
    "character" = c("ACTA2", "CSPG4", "PDGFRB", "RGS5")
  ),
  "CAF" = list(
    "character" = c("ACTA2", "ENG")
  ),
  "Osteoblasts" = list(
    "character" = c("ALPL", "COL1A1", "SPP1", "IBSP", "BGLAP")
  ),
  "Schwann_cells" = list(
    "character" = c("MBP", "PMP22", "SOX10")
  ),
  "CD8Macrophage" = list(
    "character" = c("CD163", "CD8", "MRC1")
  ),
  "Macrophages" = list(
    "character" = c("PTPRC", "CD163", "MSR1", "CD14", "CD68", "AIF1", "CSF1R", "CD69", "APOC1")
  ),
  "M1_Macrophage_General" = list(
    "character" = c("IFNG", "IL1B", "IL6", "IL12", "IL15", "IL18", "IL23", "TNF", "NOS2", "IDO1", "FAM26F", "CXCL9", "CXCL10", "CXCL11")
  ),
  "M2_Macrophage_metastasis" = list(
    "character" = c("EGF", "CSF1", "CCL18", "TGFB", "MMP2", "MMP3", "MMP7", "MMP9", "CD163", "MSR1", "MRC1", "TGM2", "FABP4", "CCL24", "CCL26")
  ),
  "M2_TumorGrowth" = list(
    "character" = c("IL6", "CXCL8", "IL10", "IL17", "IL23")
  ),
  "M2_Angiogenesis" = list(
    "character" = c("VEGF", "CXCL8", "PDGF", "MMP2", "MMP3", "MMP7", "MMP9", "ANG2", "CCL5", "CCL2")
  ),
  "M2_Immunosuppression" = list(
    "character" = c("ARG1", "MRC1", "ARG2", "IL10", "TGFB", "CCL17", "CCL18", "CCL22")
  ),
  "Monocyte_M2" = list(
    "character" = c("CSF1", "CCL2", "IL4", "IL13", "IL10", "TGFB")
  ),
  "M2_polarization" = list(
    "character" = c("IL4", "JAK1", "STAT6", "PI3K", "PIK3CA", "AKT1", "MYC", "PPARG")
  )
)

What is interesting is that the output seem to show a few correlated "GSEA" types as how I annotated. but certain "topics" were NA.

even though topic 3 is "NA" when I type in this code to look at celltype_annotations$results$3`` there is an output that states

                                                                       p.val     q.val    sscore       edge

M2_Macrophage_metastasis.character 0.00149985 0.0059994 -1.166917 0.03693869

Do you know why there is a discrepancy if I can call out the results for topic 3 why isn't it listed? the P-value seems to be significant as well?

My other question is that there are several topics that are NA including: 3, 8, 9, 10, 12: are these not identifiable because I don't have a matching set list of genes that can best correlate with the ouput? is there a way to look at perhaps the top 5-10 differentially expressed genes in these topics and determine what these cells are sort of treating it like scRNAseq data?

The GSEA that I am inputting if certain genes are not expressed within that gene set it will still pick up within the topic right? for example "CD8Macrophage" = c("CD163", "CD8", "MRC1") if CD163 and CD8 are not expressed within the pixel but MRC1 is highly expressed then it will still list it as CD8 Macrophage. Or is it looking for conditions where all 3 are highly expressed?

Lastly, topic 1, 4, 6, 7 all seem to indicate the same "cell type" why is that?

bmill3r commented 1 year ago

Hi @chrkuo

I generated a graph looking at perplexity and cell types with mean proportion <5% and I don't see a gray zone area indicating alpha >1 which means that any of these k values that I pick should be good to explore correct?

Alpha is a parameter governing the shape or concentration of the Dirichlet distribution in the fitted model. An alpha < 1 means that the cell type probability distributions of the capture locations are concentrated towards the "edges" of the Dirichlet, and so the capture locations are expected to be enriched in some of the cell types but not all of them. This is what we want and you can get a better visual sense of what I mean by this here: Examples of when STdeconvolve may fail.

Because all of the different models have an alpha <1, this is a good sign that there are distinct cell types that are heterogeneously distributed throughout the capture locations that the model is picking up on and you could use any of the models with different Ks. However, the lower the perplexity, the better the model fits the data and so picking a model with a lower perplexity is recommended. However, this is also balanced between the increasing number of rare cell types, and an increase in the number of rare cell types is an indicator that the model might be overfitting the data. So it is recommended that you pick a model with a K that balances a decreasing perplexity with an increasing number of rare cell types. Your choice of K=12 is a reasonable one.

Do you know why there is a discrepancy if I can call out the results for topic 3 why isn't it listed? the P-value seems to be significant as well?

I believe this is because the sscore (enrichment score) is negative. This means that the cell type is significantly depleted in M2 Macrophage genes. You can read more about gene set enrichment analysis here:

Subramanian, Tamayo, et al.

and in the package this function is built around here: liger

My other question is that there are several topics that are NA including: 3, 8, 9, 10, 12: are these not identifiable because I don't have a matching set list of genes that can best correlate with the ouput? is there a way to look at perhaps the top 5-10 differentially expressed genes in these topics and determine what these cells are sort of treating it like scRNAseq data?

There might be several reasons why they are not positively enriched in any of the gene sets you tested. But you can definitely look at the top differentially expressed genes in these topics to get a better sense of what cell types they might be representing. You might want to check out the tutorial Getting Started with STdeconvolve. There's a section where we look at the genes with the top log2FC for different devonvolved cell types. Additionally, we provide a function:

topGenes(beta)

which will return the top 10 genes for each topic, where beta is the beta matrix of deconvolved transcriptional profiles.

The GSEA that I am inputting if certain genes are not expressed within that gene set it will still pick up within the topic right? for example "CD8Macrophage" = c("CD163", "CD8", "MRC1") if CD163 and CD8 are not expressed within the pixel but MRC1 is highly expressed then it will still list it as CD8 Macrophage. Or is it looking for conditions where all 3 are highly expressed?

I would recommend looking at the original GSEA paper to get a better idea as to how the algorithm is looking for gene set enrichment in the data.

Lastly, topic 1, 4, 6, 7 all seem to indicate the same "cell type" why is that?

It is likely that the top significantly enriched gene set for each of these topics was the same. But it might be worth looking at the entire gene set enrichment results for each of these topics to see if there were some cell type gene sets that were very close in terms of the GSEA results. Remember that the analysis is trying to find the most similar cell type based on the provided genes in the gene set but there could be several cell types that were significantly enriched. Also, maybe these topics are capturing different cell states or subtypes. For example, maybe several topics are enriched in "neuronal" genes, but they each could be different neuronal subtypes. It depends on how specific the genes in the gene sets are and how well represented they are in the deconvolved transcriptional profiles.

Hope this helps a bit, Brendan

odunola26 commented 2 weeks ago

Hello,

Please, I am trying to annotate my cells using the GSEA function with I list I generated myself as seen below, but it is throwing errors: Error in gset[[celltype]]$character : $ operator is invalid for atomic vectors

Could you please advise on how I can fix this problem?

Create a nested list

cell_genes <- list(

Macrophages = c("PTPRC", "CD163", "MSR1", "CD14", "CD68", "AIF1", "CSF1R", "CD69", "APOC1"),

Monocytes = c("PTPRC", "CD14", "FCGR3A", "FCGR3B", "TIMP1", "CD44", "G0S2"),

NK_cells = c("PTPRC", "KLRK1", "NCR1"),

Dendritic_cells = c("PTPRC", "CD1A", "HLA-DRA", "HLA-DRB1", "CD80", "CD86"),

B_cells = c("PTPRC", "MS4A1", "CD19", "CD79A", "CD79B", "CD52", "BANK1", "IGHD", "IGHM", "CD69", "CD83"),

CD4_T_cells = c("PTPRC", "CD3D", "CD3E", "CD3G", "CD4", "IL2RA", "CD28", "FOXP3"),

CD8_T_cells = c("PTPRC", "CD3D", "CD3E", "CD3G", "CD8A", "CD8B", "GZMA", "GZMB", "PRF1"),

Treg = c("FOXP3", "CD25", "CTLA4", "IL2RA", "IKZF2", "IKZF4"),

Endothelial_cells = c("PECAM1", "FLT1", "PTPRB", "EGFL7", "VWF", "CDH5"),

Skeletal_cells = c("MYOD1", "MYOG", "DES", "ACTA1", "MYH3"),

Fibroblasts = c("ACTA2", "COL1A1", "COL1A2", "FAP", "PDGFRA", "PDGFRB"),

Adipocytes = c("FABP4", "LEP", "ADIPOQ", "PPARG", "PLIN1"),

HSC = c("HPCA1", "PROM1", "KIT", "THY1", "ENG"),

Epithelial_cells = c("EPCAM", "KRT8", "KRT18", "CDH1", "MUC1")

)

Perform annotation

celltype_annotations <- annotateCellTypesGSEA(beta = gexp, gset = cell_genes, qval = 0.05)

Error in gset[[celltype]]$character : $ operator is invalid for atomic vectors

Thank you.

Regards, Odunola