chanzuckerberg / single-cell

A collection of documents that reflect various design decisions that have been made for the cellxgene project.
MIT License
4 stars 2 forks source link

Gene Set Enrichment Analysis #252

Closed ambrosejcarr closed 11 months ago

ambrosejcarr commented 3 years ago

Goal

Story: A user with a set of genes can submit them for gene set enrichment analysis. Cellxgene returns a list of gene sets (and labels) that pass a significance threshold, ordered by the likelihood that they explain the genes being submitted.

Example ways that a user might generate a gene set, and want to understand probable mechanisms:

Approach

Generally, label recommendation proceeds as follows:

  1. User creates a query gene set in a data-driven fashion, but doesn't understand the biological mechanism that caused those genes to be co-expressed in their sample
  2. From literature, user identifies a set labeled gene sets (candidate gene sets) which might explain the mechanism
  3. User tests how well each candidate gene set matches their gene set. The user evaluates the ordered list of candidate labels to infer the likely mechanism. _Note: sometimes the top recommendation is not accepted! For example, if the top recommendation was "cell cycle process" but the next 8 were related to "apoptosis", the user might infer that the true mechanism was apoptosis-related. This is due to the fact that genes may exist across multiple sets. The true result the user is seeking is a "label cloud" for their process that they can investigate.

Implementation

There are two decision to be made:

  1. What labeled gene sets should be candidates for labeling?
  2. What test should be run to compare the candidate sets with the query set

Research on algorithms

Review of potential algorithms

This review suggests that over-representation analysis (ORA), executed with a Fisher's exact test, is the fastest approach to obtain answers. Because this method assumes that the genes in a set are independent, it may not generate accurate p-values. However, judged by performance at ranking relevant sets above non-relevant sets, it performs as well as any more computationally complex (and much longer running) approach. Due to these characteristics, ORA is the most commonly used approach.

Based on this, use of ORA is proposed

ORA execution:

Research on candidate gene sets

metakuni commented 11 months ago

@ambrosejcarr / @signechambers1 : Is this still relevant or can we close this?

signechambers1 commented 11 months ago

I think we can close for now and reopen or make a new issue if / when we decide to prioritize. Thank you @metakuni !