Differential expression via categorical metadata

colinmegill commented 5 years ago

Problems:

Strange to have diffexp / computation in left sidebar with metadata
Users cannot return to a selection once they've made it and changed it
Differential expression selections do not persist between sessions

Solution:

Compute differential expression based on selections of categorical metadata
If the diffexp button (see mock) of one categorical metadata option is selected (ie., Heart), compute cells in this category vs all other cells (invert selection)
If the diffexp button of two categorical metadata options are selected, compute diffexp comparing the two (ie., Heart vs Lung)
This issue depends is partly on #524: to perform differential expression which includes continuous or spatial metadata, the user will need to create a new categorical metadata field and use that to compute differential expression.

While this adds a bit of overhead to computing differential expression for cases involving continuous and spatial information, it allows users to return to those selections as they've been persisted as new categorical metadata fields. It reduces the number of clicks necessary to compute diffexp if only categorical metadata is involved.

Mock:

neurojacob commented 4 years ago

This solution would solve part of the issue I am having. As it currently stands, it is not obvious to users who have not read the documents (read: most) how to calculate differential expression and/or find marker genes. I would like to add that in addition to this solution, it would also be important to have a way for the user to track which differentials they are calculating. Have 'Group 1: Fat, Heart, Kidney' vs. 'Group 2: Large_Intestine' for instance.

It is also essential that differentially expressed genes are returned as a list to give context, rather than just as individual histograms. This list would need to be downloadable to be useful!

ambrosejcarr commented 4 years ago

If the diffexp button (see mock) of one categorical metadata option is selected (ie., Heart), compute cells in this category vs all other cells (invert selection)

This "one vs all else" test is typically executed to detect "marker genes", or genes that uniquely characterize a particular population. To accomplish this, a more sophisticated test function must be used. See scanpy documentation. In brief, consider the following situation:

There are populations a, b, c, d, e, and f, each of which has the same number of cells. gene 1 is expressed in populations a and b at value "1" and value "0" in the remaining populations.

A ttest comparing a with (b, c, d, e, f) detect an average expression of 1 in population a and 0.2 in the "other" population. The t-test would report a significant result, however, gene 1 is not a marker gene of population a, because it is not uniquely expressed there.

There are a number of ways to address this problem. Scanpy has some more sophisticated examples, but I give a simple one here to explain how the problem can be solved. A series of tests are run. First, an ANOVA asks the question "is the expression of gene 1 different in any of these groups?". For genes that pass this test (gene 1 would pass), follow-up pairwise t-tests are run sequentially for population a vs each other population. If and only if a is expressed at significantly heightened expression relative to all other populations, it is returned as a marker gene. In the above example, gene 1 is not expressed at a higher level in population a than population b, so it would be discarded.

colinmegill commented 3 years ago

An idea from @aopisco and Lubert Stryer — one could, on the running of differential expression, automatically create a corresponding category with labels named 'pop1 and pop2', where the category name itself is the timestamp that corresponds to the run. cc @seve

colinmegill commented 3 years ago

I think that's good enough to close this issue, and it's composable from actions we already have. We already have the creation of two genesets as a 'side effect', so more side effects are not necessarily a liability in the application.

colinmegill commented 3 years ago

The disadvantage to this path, is that while fast, it doesn't work well for lung vs heart

colinmegill commented 3 years ago

The solution is likely to have a dropdown the diffexp so that the options are —

do diffexp
do diffexp and create shadow category based on run

ambrosejcarr commented 3 years ago

I have consistently had the issue where I circle some cells, assign them to group 1, circle some other cells, assign them to group 2, compute DE, but then forget what cells I selected. Am I correct that this suggestion would address this problem?

If that's the case, I'm wondering how this fits into the longer term vision of anchoring differential expression to categories. My sense was that the proposal there was:

select cells and create category a
select cells and create category b
select a and b and select "compute de".

If lung vs heart, then you skip to

select a=lung, b=heart, and select "compute de".

On Mon, Jun 28, 2021 at 6:50 PM Colin Megill @.***> wrote:

The solution is likely to have a dropdown the diffexp so that the options are —

do diffexp

do diffexp and create shadow category based on run

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/cellxgene/issues/729#issuecomment-870097254, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH7C4BNVKHHTVNFBOBZZ63TVD4FDANCNFSM4HIRNO2A .

colinmegill commented 3 years ago

Yes. This proposal addresses the problem (also the cause of this issue in the first place) that users forget what their selection was (quite a large issue).

The initial proposal and mock ties diffexp to categorical metadata
- it forces users to make a category every time they want to compute custom diffexp that isn't already a category
- it is very fast in the case that you want 'existing category vs all else' — expand Tissue and click diffexp button next to Lung and diffexp is (optionally, but likely as it would remove another button from the top bar) automatically and immediately run on Lung vs all except lung
The amended proposal from Pisco and Stryer suggests that like automatically created diffexp gene sets, the creation of a category for the selection is procedural, and unlike gene sets, it only happens if the user chooses that side effect
- A simple dropdown on the button would suffice
- either compute
- or compute and create category from selection
- lung vs heart already exists as a category, so just compute
- lasso region vs lasso region doesn't exist, so compute and create a category with the same params as the gene set name
- category name: run timestamp
- label 1: Pop1
- label 2: Pop2

Quick to implement, since it's composed of existing actions and a side effect.

I presently believe these are mutually exclusive solutions, because keeping cell set 1 and cell set 2 concepts continue to locate diffexp on the top bar. I
have wanted for a long time to remove those, this has been my best idea to try to do so.
But the buttons are, perhaps, fast and have been durable ui component thus far, despite being a throwaway implementation.

chanzuckerberg / cellxgene