Open colinmegill opened 5 years ago
This solution would solve part of the issue I am having. As it currently stands, it is not obvious to users who have not read the documents (read: most) how to calculate differential expression and/or find marker genes. I would like to add that in addition to this solution, it would also be important to have a way for the user to track which differentials they are calculating. Have 'Group 1: Fat, Heart, Kidney' vs. 'Group 2: Large_Intestine' for instance.
It is also essential that differentially expressed genes are returned as a list to give context, rather than just as individual histograms. This list would need to be downloadable to be useful!
If the diffexp button (see mock) of one categorical metadata option is selected (ie., Heart), compute cells in this category vs all other cells (invert selection)
This "one vs all else" test is typically executed to detect "marker genes", or genes that uniquely characterize a particular population. To accomplish this, a more sophisticated test function must be used. See scanpy documentation. In brief, consider the following situation:
There are populations a, b, c, d, e, and f, each of which has the same number of cells. gene 1 is expressed in populations a and b at value "1" and value "0" in the remaining populations.
A ttest comparing a with (b, c, d, e, f) detect an average expression of 1 in population a and 0.2 in the "other" population. The t-test would report a significant result, however, gene 1 is not a marker gene of population a, because it is not uniquely expressed there.
There are a number of ways to address this problem. Scanpy has some more sophisticated examples, but I give a simple one here to explain how the problem can be solved. A series of tests are run. First, an ANOVA asks the question "is the expression of gene 1 different in any of these groups?". For genes that pass this test (gene 1 would pass), follow-up pairwise t-tests are run sequentially for population a vs each other population. If and only if a is expressed at significantly heightened expression relative to all other populations, it is returned as a marker gene. In the above example, gene 1 is not expressed at a higher level in population a than population b, so it would be discarded.
An idea from @aopisco and Lubert Stryer — one could, on the running of differential expression, automatically create a corresponding category with labels named 'pop1 and pop2', where the category name itself is the timestamp that corresponds to the run. cc @seve
I think that's good enough to close this issue, and it's composable from actions we already have. We already have the creation of two genesets as a 'side effect', so more side effects are not necessarily a liability in the application.
The disadvantage to this path, is that while fast, it doesn't work well for lung
vs heart
The solution is likely to have a dropdown the diffexp so that the options are —
I have consistently had the issue where I circle some cells, assign them to group 1, circle some other cells, assign them to group 2, compute DE, but then forget what cells I selected. Am I correct that this suggestion would address this problem?
If that's the case, I'm wondering how this fits into the longer term vision of anchoring differential expression to categories. My sense was that the proposal there was:
If lung vs heart, then you skip to
On Mon, Jun 28, 2021 at 6:50 PM Colin Megill @.***> wrote:
The solution is likely to have a dropdown the diffexp so that the options are —
- do diffexp
- do diffexp and create shadow category based on run
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/cellxgene/issues/729#issuecomment-870097254, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH7C4BNVKHHTVNFBOBZZ63TVD4FDANCNFSM4HIRNO2A .
Yes. This proposal addresses the problem (also the cause of this issue in the first place) that users forget what their selection was (quite a large issue).
Tissue
and click diffexp button next to Lung
and diffexp is (optionally, but likely as it would remove another button from the top bar) automatically and immediately run on Lung
vs all except lung
Quick to implement, since it's composed of existing actions and a side effect.
cell set 1
and cell set 2
concepts continue to locate diffexp on the top bar. I
Problems:
Solution:
While this adds a bit of overhead to computing differential expression for cases involving continuous and spatial information, it allows users to return to those selections as they've been persisted as new categorical metadata fields. It reduces the number of clicks necessary to compute diffexp if only categorical metadata is involved.
Mock: