Story: A user with a set of genes can submit them for gene set enrichment analysis. Cellxgene returns a list of gene sets (and labels) that pass a significance threshold, ordered by the likelihood that they explain the genes being submitted.
Example ways that a user might generate a gene set, and want to understand probable mechanisms:
User runs Differential Expression
User groups genes that are grouped in the dotplot by hierarchical clustering
Approach
Generally, label recommendation proceeds as follows:
User creates a query gene set in a data-driven fashion, but doesn't understand the biological mechanism that caused those genes to be co-expressed in their sample
From literature, user identifies a set labeled gene sets (candidate gene sets) which might explain the mechanism
User tests how well each candidate gene set matches their gene set. The user evaluates the ordered list of candidate labels to infer the likely mechanism. _Note: sometimes the top recommendation is not accepted! For example, if the top recommendation was "cell cycle process" but the next 8 were related to "apoptosis", the user might infer that the true mechanism was apoptosis-related. This is due to the fact that genes may exist across multiple sets. The true result the user is seeking is a "label cloud" for their process that they can investigate.
Implementation
There are two decision to be made:
What labeled gene sets should be candidates for labeling?
What test should be run to compare the candidate sets with the query set
This review suggests that over-representation analysis (ORA), executed with a Fisher's exact test, is the fastest approach to obtain answers. Because this method assumes that the genes in a set are independent, it may not generate accurate p-values. However, judged by performance at ranking relevant sets above non-relevant sets, it performs as well as any more computationally complex (and much longer running) approach. Due to these characteristics, ORA is the most commonly used approach.
Based on this, use of ORA is proposed
ORA execution:
See this article for the intuition behind ORA and how it is implemented in R.
See this answer for how to use the hypergeometric test from scipy.stats to calculate overrepresentation of a query set in a singlecandidate gene set (would need to be parallelized across candidate sets).
Goal
Story: A user with a set of genes can submit them for gene set enrichment analysis. Cellxgene returns a list of gene sets (and labels) that pass a significance threshold, ordered by the likelihood that they explain the genes being submitted.
Example ways that a user might generate a gene set, and want to understand probable mechanisms:
Approach
Generally, label recommendation proceeds as follows:
Implementation
There are two decision to be made:
Research on algorithms
Review of potential algorithms
This review suggests that over-representation analysis (ORA), executed with a Fisher's exact test, is the fastest approach to obtain answers. Because this method assumes that the genes in a set are independent, it may not generate accurate p-values. However, judged by performance at ranking relevant sets above non-relevant sets, it performs as well as any more computationally complex (and much longer running) approach. Due to these characteristics, ORA is the most commonly used approach.
Based on this, use of ORA is proposed
ORA execution:
Research on candidate gene sets