Closed inhyeoklee closed 1 year ago
Hi @inhyeoklee,
Thanks for using STdeconvolve
! I haven't used Azimuth enough to give that much advice on how to implement it, but I believe it's primary purpose is reference-based annotation of single cell RNA-seq datasets. If you're trying to use it to annotate the deconvolved cell types then theoretically you could use the beta
matrix of the deconvolved cell type transcriptional profiles, but these are the profiles of the cell types and not individual cells. My concern is that using this as input and treating the cell types as individual "cells" as input may not be a sufficient dataset for annotation. Alternatively, you could apply Azimuth to the individual spots but this wouldn't be appropriate given that the spots are assumed to be multi-cellular. We do provide tutorials to give some examples of how users might go about annotating the deconvolved cell types: Annotating deconvolved cell-types.
Hope this helps a little, Brendan
Thank you Brendan! That actually really helped my understanding of the STdeconvolve algorithm. Would that be why we use results$beta instead of results$theta (which shows expression levels) because the reference-free deconvolution inherently comes with the hypothesis that a certain cell type would have distinct, and yet uniform gene expressions (which are not necessarily true in single-cell rna-seq data)?
Also, if I'm not exactly sure which cell types I have (unlike the 5 predefined tissue layer labels in the tutorial), can I instead use the entire human cell type - top canonical marker genes reference table available on PanglaoDB?
Hi @inhyeoklee,
STdeconvolve outputs two matrices. The first is the beta
matrix, which is a matrix of the deconvolved cell type transcriptional profiles. Basically it is a cell type x gene matrix that contains the probability of a deconvolved cell type expressing a given gene. The second is the theta
matrix, which is the proportion of each cell type in each of the spots or capture locations.
STdeconvolve is built on latent Dirichlet allocation, which is looking for clusters of genes that are co-expressed together across the capture locations, so it works best when cell types have distinct transcriptional profiles, as in certain genes are specifically expressed together in different cell types. If cell types have very similar transcriptional profiles, then it will be difficult to identify clusters of genes that are co-expressed specifically together and accurately deconvolve the contributions of the different cell types.
In terms of identifying the deconvolved cell types, you can certainly explore their deconvolved transcriptional profiles and see if the top expressed genes are marker genes for known cell types annotated in databases such as PanglaoDB. A more statistical approach, which we tried to demonstrate, is looking for enrichment of cell type marker genes using gene set enrichment analysis. But this is just one suggestion - ultimately it will be up to users to decide how they would like to annotate the deconvolved cell types.
Hope this helps! Brendan
Sorry about the confusion on beta and theta matrices. And thank you Brendan - it makes sense to use gene set enrichment analysis to label cell types rather than identifying them by comparing what top marker genes both the ground truth table and my deconvolved cell types have. For gene set enrichment analysis, would it be any beneficial to use the ground truth table of all the human cell types instead of just a couple of cell types expected?
Best, Daniel
Hi @inhyeoklee
Because the GSEA function in STdeconvolve
assesses enrichment of each cell type gene set iteratively I don't see any reason why you couldn't use all human cell types. However, from an interpretation standpoint, some of the cell type hits may be cell types you would not expect to be in the tissue you are analyzing. So some pre-filtering of the cell type gene sets down to those you would expect to be in your tissue sample is probably a good idea.
Hope this helps, Brendan
Thank you for clarifying the GSEA inputs! I've managed to run the GSEA, but I'm a little confused about its results (shared below). Could you please give any pointers on how I might obtain better classification results?
Here's the prediction results I've got:
And, here's the steps I took:
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") ensembl_gene_mapping <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"), mart = ensembl)
convert_ensembl_to_symbol <- function(ensembl_id, mapping) { gene_symbol <- mapping$external_gene_name[match(ensembl_id, mapping$ensembl_gene_id)] return(gene_symbol) }
colnames(results$beta) <- sapply(colnames(results$beta), convert_ensembl_to_symbol, mapping = ensembl_gene_mapping)
Here's the reference gene set list I prepared: gset <- list( Adipocytes = c("METRNL", "MYO1C", "METRN", "MMP2", "NQO1", "COL6A2", "SYAP1", "CRNDE", "CAVIN1", "MT2A"), B_cells = c("CD79A", "CD74", "HLA-DRA", "CD79B", "CD52", "IGHM", "IGKC", "CXCR4", "CD69", "BCL11A"), B_cells_memory = c("CD80", "CCR6", "CLCA3P", "CXCR5", "FCRL2", "GUSBP11", "IFNA10", "IGLL3P", "NMBR", "PTPRCAP"), B_cells_naive = c("MS4A1", "CD79B", "IGHM", "IGKC", "LTB", "CD37", "CD52", "CD74", "LINC00926", "TNFRSF13C"), Basal_cells = c("HEBP2", "KRT17", "PLP2", "ACADVL", "S100A14", "ITGA2", "ANXA11", "KRT5", "KRT15", "LAMB3"), Basophils = c("IL3RA", "CCR3", "ENPP3", "ITGB7", "IL4", "HTR1B", "GRM6", "CEBPA", "E2F8", "MBOAT1"), Cardiac_stem_and_precursor_cells = c("MYH6", "FUT4", "TBX4", "TBX18", "NKX2-5", "KIT", "ADGRL2", "GATA4", "ABCG2", "KITLG"), Cardiomyocytes = c("NPPA", "MYL2", "RBM20", "RBM24", "MYH7B", "TCAP", "RYR2", "MYL3", "MYH6", "JPH2"), Dendritic_cells = c("HLA-DPA1", "HLA-DPB1", "HLA-DRA", "HLA-B", "HLA-A", "FTL", "HLA-DRB1", "HLA-C", "CTSS", "AIF1"), Endothelial_cells = c("GNG11", "CLDN5", "S100A10", "TM4SF1", "ID3", "EGFL7", "IGFBP7", "SPARC", "ECSCR", "ENG"), Eosinophils = c("FUT4", "CCR3", "CD244", "PTGDR2", "SIGLEC8", "PGLYRP1", "EPX", "PRG3", "IL5RA", "RNASE3"), Epithelial_cells = c("TSPAN1", "TSPAN8", "ELF3", "PHGR1", "PIGR", "EPCAM", "CEACAM5", "LGALS4", "AGR2", "GUCA2A"), Fibroblasts = c("DCN", "VIM", "LUM", "SPARC", "C1S", "COL6A2", "S100A4", "COL1A2", "IGFBP6", "MMP2"), Gamma_delta_T_cells = c("GNLY", "NKG7", "CCL5", "CST7", "CTSW", "GZMB", "GZMA", "CD7", "KLRD1", "HOPX"), Macrophages = c("TYROBP", "CYBB", "AIF1", "S100A4", "MS4A7", "MAF", "TREM2", "SLC7A7", "GPNMB", "CLEC7A"), Mast_cells = c("TPSAB1", "TPSB2", "CPA3", "HPGDS", "RGS13", "RGS1", "MS4A2", "VWA5A", "KIT", "HDC"), Megakaryocytes = c("MPL", "GP5", "CXCR1", "CXCR2", "TLR9", "ALOX12", "IL21R", "GP1BA", "ITGB3", "TREML1"), Mesothelial_cells = c("RSPO1", "MSLN", "PKHD1L1", "MUC16", "CALB2", "MEDAG", "SULF1", "COBLL1", "WT1", "MIR22HG"), Monocytes = c("IFIT1", "OAS1", "IFIT3", "MX1", "PLSCR1", "TNFSF10", "PTPRC", "PSAP", "S100A9", "HLA-DRA"), Myeloid_derived_suppressor_cells = c("FUT4", "CEACAM8", "CCR2", "CD80", "CXCR1", "ADGRE1", "CD1D", "ITGAM", "ICAM1", "IL4R"), Myofibroblasts = c("TAGLN", "GFAP", "DES", "TNS1", "ACTA2", "CDH11", "PALLD", "MYL9", "CALD1"), NK_cells = c("NKG7", "GZMA", "CCL5", "CST7", "GNLY", "S100A4", "IL32", "CTSW", "KLRD1", "IL2RG"), Natural_killer_T_cells = c("IL12RB2", "IL12RB1", "TNFRSF8", "TBX21", "STYK1", "NR4A1", "IL17RA", "SLAMF7", "S1PR1", "ZBTB16"), Neutrophils = c("ELANE", "PRTN3", "AZU1", "CTSG", "MPO", "SERPINB1", "S100A4", "LYZ", "S100A8", "S100A9"), Nuocytes = c("CRLF2", "IL17RB", "ARG1", "IL5", "ICOS", "BCL11B", "RORA", "IL1RL1", "IL13", "GATA3"), Plasma_cells = c("IGKC", "SSR4", "MZB1", "CD74", "IGHA1", "IGLC2", "JCHAIN", "TNFRSF17", "CYCS", "CD79A"), Plasmacytoid_dendritic_cells = c("LILRA4", "IRF8", "IRF7", "JCHAIN", "GZMB", "SERPINF1", "ITM2C", "MZB1", "BCL11A", "TCF4"), Platelets = c("PF4", "PPBP", "CAVIN2", "CCL5", "NCOA4", "GNG11", "SRGN", "ARHGDIB", "GPX1", "HLA-A"), Red_pulp_macrophages = c("TREML4", "CCR3", "SPIC", "SIGLEC1", "ADGRE1", "MERTK", "SLC11A1", "VCAM1", "NR1H3", "CD86"), Smooth_muscle_cells = c("MYH11", "MYLK", "MYL9", "ACTA2", "SOD3", "LMOD1", "BGN", "PLN", "NOTCH3", "CNN1"), Stromal_cells = c("ICAM3", "B4GALNT1", "TLR1", "SNED1", "TLR4", "GDF10", "TLR2", "MADCAM1", "KIT", "FAP"), T_cells = c("IL32", "TRAC", "CD3D", "TRBC2", "CD52", "S100A4", "LTB", "CD3E", "CD2", "CXCR4"), T_cytotoxic_cells = c("GZMB", "CD2", "TRAC", "CD5", "CD69", "CD28", "CD8A", "CD27"), T_follicular_helper_cells = c("CXCR5", "TNFSF4", "SLAMF1", "BCL6", "PDCD1", "IL6R", "ICOS", "P2RX7", "CD84", "CD40LG"), T_helper_cells = c("IL2", "TBX21", "CCR5", "IL4", "IL10", "CCR8", "CCR3", "PTGDR2", "HAVCR1", "IL17A"), T_memory_cells = c("CD3D", "TRBC2", "LTB", "PTPRC", "LCK", "TRAC", "CD3G", "CD3E", "CD7", "CD2"), T_regulatory_cells = c("FOXP3", "IZUMO1R", "CNGB1", "IL10", "IKZF2", "CCR4", "ENTPD1", "NT5E", "CTLA4", "FOLR1"), Vascular_smooth_muscle_cells = c("TBX18", "SEMA3D", "MYH11", "ACTA2", "WT1", "PDGFRB") )
Then, I ran the GSEA using these codes: celltype_annotations <- annotateCellTypesGSEA(beta = results$beta, gset = gset, qval = 0.05) celltype_annotations$predictions
By the way, thank you very much for providing guidance on this. I just started my bioinformatics career last month, and it really means a lot to me that you make time and effort to help out researchers using your tool.
Hi @inhyeoklee
No problem - happy to help!
To answer your questions about the GSEA results, I'll direct you to a similar question/response in another issue.
Let me know if you still have any questions, Brendan
@bmill3r Thanks to your help, I have beautifully deconvolved (and annotated!!) clusters now. I'm nearing to the closure of this thread, but if you don't mind, could you please give any advice/suggestion on the next steps I'm taking. For the downstream analysis (e.g. DEG) for these outputs, what packages or any workflow would you recommend? I normally use Scanpy, Scanorama on Python, so I'm relatively less familiar with Seurat and other downstream analysis on R.
Hi @inhyeoklee
Congrats on the annotated cell types and glad everything worked out for you! In terms of other analyses it depends on what types of biological questions you are interested in answering. If you're interested in the differentially expressed genes in each of the deconvolved cell types, you could follow a similar analyses we did in this Analysis of 10X Visium data where we get the differentially expressed genes for each deconvolved cell-type transcriptional profile.
Hope this helps, Brendan
Thank you! For the DEG analysis, I want to generate violin plots of marker genes for clusters (cell types) across different samples, but I was wondering 1) how I should go about in terms of their integration. For example, if you have three classical monocyte clusters in one sample, and four classical monocyte clusters in another sample, should I manually compare their transcriptional profiles or do you have any suggestions?
2) Also, are gene expression values inside results$beta raw or normalized?
3) Are the enrichment scores provided by the annotateCellTypesGSEA function normalized for the gene set size?
Thank you, Daniel
Hi @inhyeoklee
If I understand your first question correctly, I think you're wondering if you can combine the classical monocyte clusters in a given sample together into one single monocyte cluster. You certainly can and one strategy to do this could be summing the transcriptional profiles and normalizing to 1 then scale to 1000 for better interpretability, for example.
Recall that the transcriptional profiles are probabilities of a cell type expressing those genes so in that sense they are already normalized to sum to 1 (then scaled by some factor if desired).
For the GSEA performed in annotateCellTypesGSEA()
, I would encourage you to check out the paper Subramanian, Tamayo, et al. to understand how the enrichment score is calculated.
Hope this helps, Brendan
Thank you very much Brendan! :)
@inhyeoklee @bmill3r I have been trying to annotate using annotateCellTypesGSEA(). The issue I am having is that the top genes from topGenes(beta)...they do not actually give different cluster names as I expect. For example I have 8 topics and I see that there are genes that could define them as 8 different cell types..So i make a gset and run the command annotateCellTypesGSEA(). But I get a result like this.
celltype_annotations$predictions 1 2 3 4 5 6 "Macrophage" "Monocyte" "Progenitor_cell" "Fibroblast" "Fibroblast" "Progenitor_cell" 7 8 "Fibroblast" "B_cell"
As you can see some cell types are repeated. Could someone help me solve this. Thanks!
Hi @Umaarasu,
During GSEA, it's possible that different topics could be assigned to the same cell type. Additionally, each topic could be significantly enriched with gene sets of different cell types but the returned prediction is the cell type gene set that ranked the highest based on a given set of criteria (you can see how this is done by looking at the annotateCellTypesGSEA()
function itself). You can explore the GSEA statistics that are returned and choose to assign a cell type gene set annotation differently, which might yield a different final cell type annotation.
Hope this helps, Brendan
Hi,
Thank you very much for this amazing tool! After running STdeconvolve on human femoral & carotid plaque Visium data, I'm thinking of doing cell-type annotation with Azimuth. Do you have any suggestions on how to implement Azimuth on STdeconvolve outputs?
Thanks, Daniel Lee