NathanSkene / EWCE

Expression Weighted Celltype Enrichment. See the package website for up-to-date instructions on usage.
https://nathanskene.github.io/EWCE/index.html
53 stars 25 forks source link

How can I extract a list of genes that drive the (significant) enrichment for a particular cell type? #90

Closed KevinMarinus closed 7 months ago

KevinMarinus commented 8 months ago

Hi!

I'm using the EWCE tool and I nicely get all the plots for cellular enrichments. However, I'm also interested in which genes drive my significant results (and which genes drive the fold-change for non-signifcant results). Is there an easy way to extract the genes that drive the enrichments?

Thanks! Kevin

Al-Murphy commented 8 months ago

Hey!

You can compare your input gene list to the specificity matrix for the reference dataset to see the specificity values for each gene in a cell type. This should give you an idea. Note that EWCE can account for gene length and GC content (EWCE::bootstrap_enrichment_test(geneSizeControl)) which you won't get insight into when looking at the specificity matrix.

Here is how to get the specificity matrix:

ctd <- ewceData::ctd()
ctd[[1]]$specificity

You could then sort and see where your genes fall.

Thanks, Alan.

bschilder commented 8 months ago

@KevinMarinus I think your question should be answered here:

KevinMarinus commented 8 months ago

Thanks for your quick and helpful replies. Indeed, the qqplot with gene annotation from "generate_bootstrap_plots" contains the information that I'm searching for. The problem is that the annotations are too big and overlap each other making them unfortunately unreadable (see attachment). I tried to extract the annotations using "sjlabelled" and their "get_labels" function but that didn't work either. Is there a way to adjust the labels in the figure to make them readable and/or to extract the labels (with the criteria for which gene is labelled)?

qqplot_wtgene_____level1.pdf

Al-Murphy commented 7 months ago

So we don't have that functionality coded yet but you can get at the ggplot object for any of the plots you like, see below:

## Load the single cell data
sct_data <- ewceData::ctd()

## Set the parameters for the analysis
## Use 5 bootstrap lists for speed, for publishable analysis use >10000
reps <- 5

## Load the gene list and get human orthologs
hits <- ewceData::example_genelist()[1:100]

## Bootstrap significance test,
##  no control for transcript length or GC content
## Use pre-computed results to speed up example
full_results <- EWCE::example_bootstrap_results()

output <- EWCE::generate_bootstrap_plots(
  sct_data = sct_data,
  hits = hits,
  reps = reps,
  full_results = full_results,
  listFileName = "Example",
  sctSpecies = "mouse",
  genelistSpecies = "human",
  annotLevel = 1,
  save_dir = tempdir()
)

#get ggplot object for plot 2
ggplot_obj <- output$plots$plot2

Then you can get the data for the plot (for example) to see the values for all annotated genes:

output$plots$plot2$data

You can then remove labels for all but the top genes and then replot if you wanted.

See the function code here which references the functions for each plot: https://github.com/NathanSkene/EWCE/blob/master/R/generate_bootstrap_plots_for_transcriptome.r

KevinMarinus commented 7 months ago

Thanks again! This helped me a lot already.

I have two other issues with the bootstrapplots: 1) I'm using human data and a human CTD but, although I specify in bootstrapplots that sctSpecies_="human", it gives a warning that it's automatically set to mouse: "Warning: sctSpecies not provided. Setting to 'mouse' by default. Warning: sctSpecies_origin not provided. Setting to 'mouse' by default.". Adjusting sctSpecies to sctSpecies_origin doesn't solve the issue. The rest all seems to work so perhaps it's a false warning(?).

2) Regardless of this warning everything works fine for level 1 but not for level 2; for level2 it gives, among other errors, the following error: "ERROR: No cell types in full_results are found in sct_data. Perhaps the wrong annotLevel was used?". This error only occurs when one CTD contains both level 1 and 2, since seperating the CTD (i.e., making a specific CTD for level 1 and a specific CTD for level 2, where in the latter level 2 is indicated as level 1) solves this issue. However, this is quite a cumbersome approach for bigger scripts with multiple input datafiles.

Is there a way to run the bootstrapplots on level 2 when one human CTD contains both levels?

My colleague had exactly the same issue with human data and I therefore assume it pops up with any human dataset combining level1&2 in one CTD. I've attached the console in- and output to illustrate the issue with the same annotation as here (issue 1 and 2).

GitHub errors.docx

bschilder commented 7 months ago

Thanks again! This helped me a lot already.

I have two other issues with the bootstrapplots:

  1. I'm using human data and a human CTD but, although I specify in bootstrapplots that sctSpecies_="human", it gives a warning that it's automatically set to mouse: "Warning: sctSpecies not provided. Setting to 'mouse' by default. Warning: sctSpecies_origin not provided. Setting to 'mouse' by default.". Adjusting sctSpecies to sctSpecies_origin doesn't solve the issue. The rest all seems to work so perhaps it's a false warning(?).
  2. Regardless of this warning everything works fine for level 1 but not for level 2; for level2 it gives, among other errors, the following error: "ERROR: No cell types in full_results are found in sct_data. Perhaps the wrong annotLevel was used?". This error only occurs when one CTD contains both level 1 and 2, since seperating the CTD (i.e., making a specific CTD for level 1 and a specific CTD for level 2, where in the latter level 2 is indicated as level 1) solves this issue. However, this is quite a cumbersome approach for bigger scripts with multiple input datafiles.

Is there a way to run the bootstrapplots on level 2 when one human CTD contains both levels?

My colleague had exactly the same issue with human data and I therefore assume it pops up with any human dataset combining level1&2 in one CTD. I've attached the console in- and output to illustrate the issue with the same annotation as here (issue 1 and 2).

GitHub errors.docx

@KevinMarinus I'd recommend submitting this as a separate Issue with a reprex (and filling out the full bug report template) as it's outside the scope of this Issue.