chanzuckerberg / cellxgene

An interactive explorer for single-cell transcriptomics data
https://chanzuckerberg.github.io/cellxgene/
MIT License
627 stars 116 forks source link

Differential expression via categorical metadata #729

Open colinmegill opened 5 years ago

colinmegill commented 5 years ago

Problems:

Solution:

While this adds a bit of overhead to computing differential expression for cases involving continuous and spatial information, it allows users to return to those selections as they've been persisted as new categorical metadata fields. It reduces the number of clicks necessary to compute diffexp if only categorical metadata is involved.

Mock:

image

neurojacob commented 4 years ago

This solution would solve part of the issue I am having. As it currently stands, it is not obvious to users who have not read the documents (read: most) how to calculate differential expression and/or find marker genes. I would like to add that in addition to this solution, it would also be important to have a way for the user to track which differentials they are calculating. Have 'Group 1: Fat, Heart, Kidney' vs. 'Group 2: Large_Intestine' for instance.

It is also essential that differentially expressed genes are returned as a list to give context, rather than just as individual histograms. This list would need to be downloadable to be useful!

ambrosejcarr commented 4 years ago

If the diffexp button (see mock) of one categorical metadata option is selected (ie., Heart), compute cells in this category vs all other cells (invert selection)

This "one vs all else" test is typically executed to detect "marker genes", or genes that uniquely characterize a particular population. To accomplish this, a more sophisticated test function must be used. See scanpy documentation. In brief, consider the following situation:

There are populations a, b, c, d, e, and f, each of which has the same number of cells. gene 1 is expressed in populations a and b at value "1" and value "0" in the remaining populations.

A ttest comparing a with (b, c, d, e, f) detect an average expression of 1 in population a and 0.2 in the "other" population. The t-test would report a significant result, however, gene 1 is not a marker gene of population a, because it is not uniquely expressed there.

There are a number of ways to address this problem. Scanpy has some more sophisticated examples, but I give a simple one here to explain how the problem can be solved. A series of tests are run. First, an ANOVA asks the question "is the expression of gene 1 different in any of these groups?". For genes that pass this test (gene 1 would pass), follow-up pairwise t-tests are run sequentially for population a vs each other population. If and only if a is expressed at significantly heightened expression relative to all other populations, it is returned as a marker gene. In the above example, gene 1 is not expressed at a higher level in population a than population b, so it would be discarded.

colinmegill commented 3 years ago

An idea from @aopisco and Lubert Stryer — one could, on the running of differential expression, automatically create a corresponding category with labels named 'pop1 and pop2', where the category name itself is the timestamp that corresponds to the run. cc @seve

colinmegill commented 3 years ago

I think that's good enough to close this issue, and it's composable from actions we already have. We already have the creation of two genesets as a 'side effect', so more side effects are not necessarily a liability in the application.

colinmegill commented 3 years ago

The disadvantage to this path, is that while fast, it doesn't work well for lung vs heart

colinmegill commented 3 years ago

The solution is likely to have a dropdown the diffexp so that the options are —

  1. do diffexp
  2. do diffexp and create shadow category based on run
ambrosejcarr commented 3 years ago

I have consistently had the issue where I circle some cells, assign them to group 1, circle some other cells, assign them to group 2, compute DE, but then forget what cells I selected. Am I correct that this suggestion would address this problem?

If that's the case, I'm wondering how this fits into the longer term vision of anchoring differential expression to categories. My sense was that the proposal there was:

  1. select cells and create category a
  2. select cells and create category b
  3. select a and b and select "compute de".

If lung vs heart, then you skip to

  1. select a=lung, b=heart, and select "compute de".

On Mon, Jun 28, 2021 at 6:50 PM Colin Megill @.***> wrote:

The solution is likely to have a dropdown the diffexp so that the options are —

  1. do diffexp
  2. do diffexp and create shadow category based on run

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/cellxgene/issues/729#issuecomment-870097254, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH7C4BNVKHHTVNFBOBZZ63TVD4FDANCNFSM4HIRNO2A .

colinmegill commented 3 years ago

Yes. This proposal addresses the problem (also the cause of this issue in the first place) that users forget what their selection was (quite a large issue).

Quick to implement, since it's composed of existing actions and a side effect.