forrestzhaosen commented 1 year ago

Is it possible to generate a list of genes that drive the significance of the enriched cell type? Thanks!

Al-Murphy commented 1 year ago

Hey!

Let me explain a little about how EWCE works since what you are asking doesn't really fit into EWCE's usual use cases. So EWCE uses reference single-cell/single-nucleus RNA-seq datasets to work the specificity of genes' expressions to specific cell types. In this sense, you can get these specificity values for each cell type:

#this reference dataset is from mouse cortex and hypothalamus single-cell RNA-seq data (from the Karolinska Institute).
ctd <- ewceData::ctd()
#Different cell levels can be inspected, let's go with the highest cell level groupings and look at the specificity matrix
ctd[[1]]$specificity[1:10,]
              astrocytes_ependymal endothelial-mural interneurons  microglia oligodendrocytes pyramidal CA1 pyramidal SS
Tspan12                 0.20156111        0.43344472    0.1276534 0.09963109       0.04496569   0.060313947   0.03243007
Tshz1                   0.17881643        0.22553038    0.1572263 0.20926498       0.09224942   0.009966436   0.12694604
Fnbp1l                  0.10217628        0.22373111    0.1993075 0.09898422       0.04575399   0.081178989   0.24886792
Adamts15                0.03860933        0.51199973    0.3113053 0.00000000       0.06116274   0.017480158   0.05944269
Cldn12                  0.19225487        0.11971558    0.1550142 0.05023070       0.13721201   0.130088678   0.21548392
Rxfp1                   0.00000000        0.10061558    0.5423008 0.00000000       0.06304908   0.114920073   0.17911438
2310042E22Rik           0.03910053        0.12023499    0.6568037 0.03405283       0.05650745   0.077247458   0.01605304
Sema3c                  0.04175674        0.25607623    0.5989986 0.04959018       0.02405407   0.011796175   0.01772802
Jam2                    0.30642131        0.33054824    0.2148130 0.06456381       0.04821188   0.013628951   0.02181276
Apbb1ip                 0.00000000        0.03384452    0.2095321 0.69014938       0.03752203   0.012885393   0.01606654

However, going to your cell type of interest and picking the most specific genes may not be a sensible thing to do as the actual expression levels of these genes, for example, could be very low but it might just happen that the small amount of expression that was noted came from this cell type. It is worth understanding what we mean by specificity here before trying to use this data for your own problem.

You want:

a list of genes that drive the significance of the enriched cell type

From what I understand, you are trying to get a list of genes that are specific to a cell type. This could be done by taking say the top 10% quantile of genes from this specificity matrix using a reference dataset from the region of interest (see here for documentation for creating your own). However note the caveats and possible issues with this I give above. Have a read of the original publication to get a better understanding of these concepts and how we use them in EWCE - https://www.frontiersin.org/articles/10.3389/fnins.2016.00016/full

Cheers, Alan.

bschilder commented 1 year ago

@Al-Murphy I think what @forrestzhaosen is asking for is actually something a bit different. i.e. Given cell-type-specific enrichment for some gene list, what are the cell-type specific genes that are most strongly driving that enrichment. Since not all genes will overlap between the gene list and the genes with the top specificity quantiles, simply taking the latter wouldn't be sufficient.

EWCE doesn't currently return this kind of information, but I take your point @forrestzhaosen . One simple solution would be extend what @Al-Murphy is suggesting and find the intersection between genes with the gene list and the genes in the top specificity quantiles for a given enriched celltype. Ideally, we would want to collect some stats on this across all bootstrap iterations. I don't think this would be too hard to implement, and might be quite useful in certain cases.

NathanSkene commented 1 year ago

You can get the bootstrapping plots, which show the probability of each gene being there, at each position in the ranked list. There is a function in EWCE for this.

But I don’t think there is a subset of genes driving the enrichments, in most cases

Sent from Outlook for iOShttps://aka.ms/o0ukef

From: Brian M. Schilder @.> Sent: Thursday, March 9, 2023 12:05:26 PM To: NathanSkene/EWCE @.> Cc: Subscribed @.***> Subject: Re: [NathanSkene/EWCE] Driver gene identification (Issue #77)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

@Al-Murphyhttps://github.com/Al-Murphy I think what @forrestzhaosenhttps://github.com/forrestzhaosen is asking for is actually something a bit different. i.e. Given cell-type-specific enrichment for some gene list, what are the cell-type specific genes that are most strongly driving that enrichment. Since not all genes will overlap between the gene list and the genes with the top specificity quantiles, simply taking the latter wouldn't be sufficient.

EWCE doesn't currently return this kind of information, but I take your point @forrestzhaosenhttps://github.com/forrestzhaosen . One simply solution would be extend what @Al-Murphyhttps://github.com/Al-Murphy is suggesting and find the intersection between genes with the gene list and the genes in the top specificity quantiles for a given enriched celltype. Ideally, we would want to collect some stats on this across all bootstrap iterations. I don't think this would be too hard to implement, and might be quite useful in certain cases.

— Reply to this email directly, view it on GitHubhttps://github.com/NathanSkene/EWCE/issues/77#issuecomment-1461914838, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE2CXDNV5MQHAB62HG3W3HBQNANCNFSM6AAAAAAVULGEXQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

bschilder commented 1 year ago

You can get the bootstrapping plots, which show the probability of each gene being there, at each position in the ranked list. There is a function in EWCE for this. But I don’t think there is a subset of genes driving the enrichments, in most cases

Oh right, forgot about that function! This what more what I mean. If I'm remembering correctly though, the plots were implemented in such a way that the enrichments tests had to be rerun, which isn't ideal since it doubles compute time and isn't tied directly to your first round of enrichment results (due to the stochasticity).

If am indeed remembering correctly, maybe we can add a feature to the main bootstrapping function that computes the per-gene probability and stores them in another slot within the output.

bschilder commented 1 year ago

I believe this is the function in question @NathanSkene . Does this seem like what you're looking for @forrestzhaosen ? https://nathanskene.github.io/EWCE/reference/generate_bootstrap_plots.html

And here is what i mean about the bootstrapping tests being redone: https://github.com/NathanSkene/EWCE/blob/9b5b21fb52e93c78b5c57923db19f2879f0910f8/R/generate_bootstrap_plots.r#L157

Looking into the code now to see if we can just record these gene-wise probabilities during the bootstrap tests conducted by bootstrap_enrichment_test

forrestzhaosen commented 1 year ago

Thank you all! That's really helpful. I just tried generate_bootstrap_plots but met this ERROR: _Error in rownames<-(*tmp*, value = sprintf("Rep%s", seqlen(nReps))) : attempt to set 'rownames' on an object with no dimensions I couldn't figure this out by reading your source code. I am using EWCE version 1.6. Do you get any quick idea about it?

bschilder commented 1 year ago

Thank you all! That's really helpful. I just tried generate_bootstrap_plots but met this ERROR: _Error in rownames<-(*tmp*, value = sprintf("Rep%s", seqlen(nReps))) : attempt to set 'rownames' on an object with no dimensions I couldn't figure this out by reading your source code. I am using EWCE version 1.6. Do you get any quick idea about it?

@forrestzhaosen Please do include a reproducible example using the Bugs template in a new Issue. Otherwise there's not much we can do for you.

In other news, making progress on implementing the built-in gene scoring for Looks something like this atm: Screenshot 2023-03-09 at 15 54 18

forrestzhaosen commented 1 year ago

Thanks! That's amazing.

bschilder commented 1 year ago

Actually, needs to be cell-type specific scores, and be based on the specificity of each gene within a cellular context. So I think this works better. Screenshot 2023-03-09 at 17 02 27

bschilder commented 1 year ago

Ok, so i rewrote much of generate_bootstrap_plots plots to make a number of improvments:

Revamp wrap code into reusable subfunctions.
Avoid resampling random genes when the gene data is stored in the bootstrap results as gene_data. It will also tell you which of these options it's using.
Save with ggsave instead if grDevices.
Facet by celltype instead of generating tons of separate plots.
Let users decide cutoff threshold with new arg adj_pval_thresh
Now returns a named list with the plots themselves ("plot") and the paths to where they're saved ("paths") rather than just a higher-level directory path in which users had to search for the right files (and didn't ever have access to the ggplot2 objects).

Here's what the plots look like now:

Reprex

## Load the single cell data
sct_data <- ewceData::ctd()

## Set the parameters for the analysis
## Use 5 bootstrap lists for speed, for publishable analysis use >10000
reps <- 5

## Load the gene list and get human orthologs
hits <- ewceData::example_genelist()[1:100]

## Bootstrap significance test,
##  no control for transcript length or GC content
## Use pre-computed results to speed up example
full_results <- EWCE::example_bootstrap_results()

output <- EWCE::generate_bootstrap_plots(
    sct_data = sct_data,
    hits = hits,
    reps = reps,
    full_results = full_results,
    listFileName = "Example",
    sctSpecies = "mouse",
    genelistSpecies = "human",
    annotLevel = 1,
    save_dir = tempdir()
)

Plots

Plot 1

Plot 2

Plot 3

Plot 4

Let me know if anything seems awry with these @NathanSkene

bschilder commented 1 year ago

actually, shouldn't the y-axes be labeled "Specificity in cell type"? Since it's the specificity matrix that it ultimately being used to generate these plots (currently, and before I touched anything)

bschilder commented 1 year ago

Note to self, I'll also revamp the generate_bootstrap_plots_for_transcriptome func to match the upgrades in generate_bootstrap_plots. This is not automatic since a lot of the original code was just a separate copy of generate_bootstrap_plots, as opposed to using subfunctions that could be applied to both. This kind of structure inflates the size of EWCE's total code and leaves room for unintended divergence between function behaviour. When I have time, I'll try to refactor some of the code to alleviate this issue.

bschilder commented 1 year ago

@Al-Murphy I've pushed the changes I've made so far to generate_bootstrap_plots so that @forrestzhaosen can start using it right away. But I'd hold off on pushing upstream to Bioc just yet as I want to finish the generate_bootstrap_plots_for_transcriptome upgrades first.

@forrestzhaosen to install the dev version of EWCE use:

remotes::install_github("NathanSkene/EWCE", dependencies = TRUE, upgrade = "always")

forrestzhaosen commented 1 year ago

Thank you so much! It's greatly appreciated.

bschilder commented 1 year ago

Changes to the generate_bootstrap_plots_for_transcriptome function have now been pushed as well.

NathanSkene / EWCE

Driver gene identification #77

Reprex

Plots

Plot 1

Plot 2

Plot 3

Plot 4