Closed forrestzhaosen closed 1 year ago
Hey!
Let me explain a little about how EWCE works since what you are asking doesn't really fit into EWCE's usual use cases. So EWCE uses reference single-cell/single-nucleus RNA-seq datasets to work the specificity of genes' expressions to specific cell types. In this sense, you can get these specificity values for each cell type:
#this reference dataset is from mouse cortex and hypothalamus single-cell RNA-seq data (from the Karolinska Institute).
ctd <- ewceData::ctd()
#Different cell levels can be inspected, let's go with the highest cell level groupings and look at the specificity matrix
ctd[[1]]$specificity[1:10,]
astrocytes_ependymal endothelial-mural interneurons microglia oligodendrocytes pyramidal CA1 pyramidal SS
Tspan12 0.20156111 0.43344472 0.1276534 0.09963109 0.04496569 0.060313947 0.03243007
Tshz1 0.17881643 0.22553038 0.1572263 0.20926498 0.09224942 0.009966436 0.12694604
Fnbp1l 0.10217628 0.22373111 0.1993075 0.09898422 0.04575399 0.081178989 0.24886792
Adamts15 0.03860933 0.51199973 0.3113053 0.00000000 0.06116274 0.017480158 0.05944269
Cldn12 0.19225487 0.11971558 0.1550142 0.05023070 0.13721201 0.130088678 0.21548392
Rxfp1 0.00000000 0.10061558 0.5423008 0.00000000 0.06304908 0.114920073 0.17911438
2310042E22Rik 0.03910053 0.12023499 0.6568037 0.03405283 0.05650745 0.077247458 0.01605304
Sema3c 0.04175674 0.25607623 0.5989986 0.04959018 0.02405407 0.011796175 0.01772802
Jam2 0.30642131 0.33054824 0.2148130 0.06456381 0.04821188 0.013628951 0.02181276
Apbb1ip 0.00000000 0.03384452 0.2095321 0.69014938 0.03752203 0.012885393 0.01606654
However, going to your cell type of interest and picking the most specific genes may not be a sensible thing to do as the actual expression levels of these genes, for example, could be very low but it might just happen that the small amount of expression that was noted came from this cell type. It is worth understanding what we mean by specificity here before trying to use this data for your own problem.
You want:
a list of genes that drive the significance of the enriched cell type
From what I understand, you are trying to get a list of genes that are specific to a cell type. This could be done by taking say the top 10% quantile of genes from this specificity matrix using a reference dataset from the region of interest (see here for documentation for creating your own). However note the caveats and possible issues with this I give above. Have a read of the original publication to get a better understanding of these concepts and how we use them in EWCE - https://www.frontiersin.org/articles/10.3389/fnins.2016.00016/full
Cheers, Alan.
@Al-Murphy I think what @forrestzhaosen is asking for is actually something a bit different. i.e. Given cell-type-specific enrichment for some gene list, what are the cell-type specific genes that are most strongly driving that enrichment. Since not all genes will overlap between the gene list and the genes with the top specificity quantiles, simply taking the latter wouldn't be sufficient.
EWCE doesn't currently return this kind of information, but I take your point @forrestzhaosen . One simple solution would be extend what @Al-Murphy is suggesting and find the intersection between genes with the gene list and the genes in the top specificity quantiles for a given enriched celltype. Ideally, we would want to collect some stats on this across all bootstrap iterations. I don't think this would be too hard to implement, and might be quite useful in certain cases.
You can get the bootstrapping plots, which show the probability of each gene being there, at each position in the ranked list. There is a function in EWCE for this.
But I don’t think there is a subset of genes driving the enrichments, in most cases
Sent from Outlook for iOShttps://aka.ms/o0ukef
From: Brian M. Schilder @.> Sent: Thursday, March 9, 2023 12:05:26 PM To: NathanSkene/EWCE @.> Cc: Subscribed @.***> Subject: Re: [NathanSkene/EWCE] Driver gene identification (Issue #77)
This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.
@Al-Murphyhttps://github.com/Al-Murphy I think what @forrestzhaosenhttps://github.com/forrestzhaosen is asking for is actually something a bit different. i.e. Given cell-type-specific enrichment for some gene list, what are the cell-type specific genes that are most strongly driving that enrichment. Since not all genes will overlap between the gene list and the genes with the top specificity quantiles, simply taking the latter wouldn't be sufficient.
EWCE doesn't currently return this kind of information, but I take your point @forrestzhaosenhttps://github.com/forrestzhaosen . One simply solution would be extend what @Al-Murphyhttps://github.com/Al-Murphy is suggesting and find the intersection between genes with the gene list and the genes in the top specificity quantiles for a given enriched celltype. Ideally, we would want to collect some stats on this across all bootstrap iterations. I don't think this would be too hard to implement, and might be quite useful in certain cases.
— Reply to this email directly, view it on GitHubhttps://github.com/NathanSkene/EWCE/issues/77#issuecomment-1461914838, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE2CXDNV5MQHAB62HG3W3HBQNANCNFSM6AAAAAAVULGEXQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>
You can get the bootstrapping plots, which show the probability of each gene being there, at each position in the ranked list. There is a function in EWCE for this. But I don’t think there is a subset of genes driving the enrichments, in most cases
Oh right, forgot about that function! This what more what I mean. If I'm remembering correctly though, the plots were implemented in such a way that the enrichments tests had to be rerun, which isn't ideal since it doubles compute time and isn't tied directly to your first round of enrichment results (due to the stochasticity).
If am indeed remembering correctly, maybe we can add a feature to the main bootstrapping function that computes the per-gene probability and stores them in another slot within the output.
I believe this is the function in question @NathanSkene . Does this seem like what you're looking for @forrestzhaosen ? https://nathanskene.github.io/EWCE/reference/generate_bootstrap_plots.html
And here is what i mean about the bootstrapping tests being redone: https://github.com/NathanSkene/EWCE/blob/9b5b21fb52e93c78b5c57923db19f2879f0910f8/R/generate_bootstrap_plots.r#L157
Looking into the code now to see if we can just record these gene-wise probabilities during the bootstrap tests conducted by bootstrap_enrichment_test
Thank you all! That's really helpful.
I just tried generate_bootstrap_plots but met this ERROR:
_Error in rownames<-
(*tmp*
, value = sprintf("Rep%s", seqlen(nReps))) :
attempt to set 'rownames' on an object with no dimensions
I couldn't figure this out by reading your source code. I am using EWCE version 1.6.
Do you get any quick idea about it?
Thank you all! That's really helpful. I just tried generate_bootstrap_plots but met this ERROR: _Error in
rownames<-
(*tmp*
, value = sprintf("Rep%s", seqlen(nReps))) : attempt to set 'rownames' on an object with no dimensions I couldn't figure this out by reading your source code. I am using EWCE version 1.6. Do you get any quick idea about it?
@forrestzhaosen Please do include a reproducible example using the Bugs template in a new Issue. Otherwise there's not much we can do for you.
In other news, making progress on implementing the built-in gene scoring for
Looks something like this atm:
Thanks! That's amazing.
Actually, needs to be cell-type specific scores, and be based on the specificity of each gene within a cellular context.
So I think this works better.
Ok, so i rewrote much of generate_bootstrap_plots
plots to make a number of improvments:
gene_data
.
It will also tell you which of these options it's using.ggsave
instead if grDevices
.adj_pval_thresh
Here's what the plots look like now:
## Load the single cell data
sct_data <- ewceData::ctd()
## Set the parameters for the analysis
## Use 5 bootstrap lists for speed, for publishable analysis use >10000
reps <- 5
## Load the gene list and get human orthologs
hits <- ewceData::example_genelist()[1:100]
## Bootstrap significance test,
## no control for transcript length or GC content
## Use pre-computed results to speed up example
full_results <- EWCE::example_bootstrap_results()
output <- EWCE::generate_bootstrap_plots(
sct_data = sct_data,
hits = hits,
reps = reps,
full_results = full_results,
listFileName = "Example",
sctSpecies = "mouse",
genelistSpecies = "human",
annotLevel = 1,
save_dir = tempdir()
)
Let me know if anything seems awry with these @NathanSkene
actually, shouldn't the y-axes be labeled "Specificity in cell type"? Since it's the specificity matrix that it ultimately being used to generate these plots (currently, and before I touched anything)
Note to self, I'll also revamp the generate_bootstrap_plots_for_transcriptome
func to match the upgrades in generate_bootstrap_plots
. This is not automatic since a lot of the original code was just a separate copy of generate_bootstrap_plots
, as opposed to using subfunctions that could be applied to both. This kind of structure inflates the size of EWCE
's total code and leaves room for unintended divergence between function behaviour. When I have time, I'll try to refactor some of the code to alleviate this issue.
@Al-Murphy I've pushed the changes I've made so far to generate_bootstrap_plots
so that @forrestzhaosen can start using it right away. But I'd hold off on pushing upstream to Bioc just yet as I want to finish the generate_bootstrap_plots_for_transcriptome
upgrades first.
@forrestzhaosen to install the dev version of EWCE use:
remotes::install_github("NathanSkene/EWCE", dependencies = TRUE, upgrade = "always")
Thank you so much! It's greatly appreciated.
Changes to the generate_bootstrap_plots_for_transcriptome
function have now been pushed as well.
Is it possible to generate a list of genes that drive the significance of the enriched cell type? Thanks!