Gene Set Enrichment Analysis

ambrosejcarr commented 3 years ago

Goal

Story: A user with a set of genes can submit them for gene set enrichment analysis. Cellxgene returns a list of gene sets (and labels) that pass a significance threshold, ordered by the likelihood that they explain the genes being submitted.

Example ways that a user might generate a gene set, and want to understand probable mechanisms:

User runs Differential Expression
User groups genes that are grouped in the dotplot by hierarchical clustering

Approach

Generally, label recommendation proceeds as follows:

User creates a query gene set in a data-driven fashion, but doesn't understand the biological mechanism that caused those genes to be co-expressed in their sample
From literature, user identifies a set labeled gene sets (candidate gene sets) which might explain the mechanism
User tests how well each candidate gene set matches their gene set. The user evaluates the ordered list of candidate labels to infer the likely mechanism. _Note: sometimes the top recommendation is not accepted! For example, if the top recommendation was "cell cycle process" but the next 8 were related to "apoptosis", the user might infer that the true mechanism was apoptosis-related. This is due to the fact that genes may exist across multiple sets. The true result the user is seeking is a "label cloud" for their process that they can investigate.

Implementation

There are two decision to be made:

What labeled gene sets should be candidates for labeling?
What test should be run to compare the candidate sets with the query set

Research on algorithms

Review of potential algorithms

This review suggests that over-representation analysis (ORA), executed with a Fisher's exact test, is the fastest approach to obtain answers. Because this method assumes that the genes in a set are independent, it may not generate accurate p-values. However, judged by performance at ranking relevant sets above non-relevant sets, it performs as well as any more computationally complex (and much longer running) approach. Due to these characteristics, ORA is the most commonly used approach.

Based on this, use of ORA is proposed

ORA execution:

See this article for the intuition behind ORA and how it is implemented in R.
See this answer for how to use the hypergeometric test from scipy.stats to calculate overrepresentation of a query set in a single candidate gene set (would need to be parallelized across candidate sets).

Research on candidate gene sets

The enrichr publication has some great research on candidate gene sets
Additional research required.

metakuni commented 11 months ago

@ambrosejcarr / @signechambers1 : Is this still relevant or can we close this?

signechambers1 commented 11 months ago

I think we can close for now and reopen or make a new issue if / when we decide to prioritize. Thank you @metakuni !

chanzuckerberg / single-cell