chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

Computational biologists can identify genes underlying the biological differences between groups of Census cells, via fast, batch-corrected atlas-level differential gene expression. #847

Open atolopko-czi opened 11 months ago

atolopko-czi commented 11 months ago

[Re-written by @pablo-gar]

Context

Census launched in 2023 with its baseline functionality for efficient data access to the largest aggregation of standardized single-cell data, providing a stable data structure and API.

Census has demonstrated its utility to accelerate workflows for “expert data analysts” reducing >3 months of work to less than a day, in particular around data gathering and wrangling. Please the user segments document.

In 2024 we aim to continue our support of “expert data analysts” by improving upon existing features around data access patterns, but our bigger focus will become “data and tool consumers” and high-value “tool builders” (defined as those with high reach to “data and tool consumers”).

One of the main workflows that “data and tool consumers” follow when analyzing single-cell data is differential gene expression to identify underlying biological sources that differentiate different groups of cells.

This epic encapsulates user-stories and requirements for Census to support differential gene expression.

Stories

  1. I am a computational biologist who wants to identify significantly differentially expressed genes between any two groups of cells in Census as defined by the Census metadata. I want to identify genes that underlie biological differences rather than technical differences due to the multi-dataset nature of Census.
  2. I am a computational biologist who wants to identify significantly differentially expressed genes between any two groups of cells in Census as defined by the Census metadata. . An while I'd like to account for batch effects present in Census, I'd like to have flexibility to select biological covariates to account for based on the hypothesis I'm testing, for example in some cases I would not like to account for age or ancestry during my analysis.

Product requirements

An experimental API that can perform differential gene expression between any two groups of cells from Census as defined by the Census metadata while accounting for known batch effects.

Refinements needed

Differential gene expression challenges with Census-scale data

Potential methods to perform comparisons

Memento

WIP

T-test with batch effect correction (liner regression with covariates)

WIP

prathapsridharan commented 8 months ago

@pablo-gar - Thanks for product requirements write up. Some thoughts (@atolopko-czi @atarashansky - please advise as well):

  1. To distinguish 2 populations to compare, you have stated there needs to be: "obs filter", "feature filter", "group A assertion that is subset of the obs filter", "group B assertion that is subset of the obs filter". I want to clarify if this is the expected way by census users to invoke differential expression.

As an alternative, one could specify: "feature filter", "group A obs query", "group B obs query" and get rid of "obs filter". This more explicitly identifies group A and group B. I am not advocating for it. In fact, there are failure modes with the latter in that one could specify a "group A" and "group B" that are entirely disjoint in that there are no groups of obs rows between the 2 populations greater than 1.

So is the approach you laid out: "obs filter", "feature filter", "group A assertion", "group B assertion" the convention that users will be most comfortable with rather than any other convention? For example, the current code requires the user to provide: "obs filter query" and a "treatment" (ex: sex_ontology_term_id) attribute such that the "treatment" attributed has only 2 distinct values in result set returned by the "obs query filter" - this effectively determines "group A" and "group B". Is this way intuitive enough for the users?

  1. Regarding "The list of genes should have associated statistics encoding significance" - Does the statistic change depending on the methodology we use? For example, if we swap out memento for t-test or some other fancy method? Should we even care about designing to support ability to swap implementations now?
  2. Related to (2) above, in what order the genes should be returned? For example, the current memento implementation returns genes in descending order of z-score. Is that sufficient? How does this change if we have to swap one implementation for another? Should we even care about designing to support ability to swap implementations now?
  3. To address the performance requirements, we will need to identify a priori queries to benchmark (that is, queries comparing millions of obs and queries comparing hundreds of thousands of obs). If you have such queries already please provide them.

I think alignment on (1), (2), and (3) is enough to get started with a tech spec

atolopko-czi commented 8 months ago

As an alternative, one could specify: "feature filter", "group A obs query", "group B obs query" and get rid of "obs filter". This more explicitly identifies group A and group B. I am not advocating for it.

Note that you'd still have to verify that each individual query has single-valued treatment column and that the two query results produce rows with different treatment values.

In fact, there are failure modes with the latter in that one could specify a "group A" and "group B" that are entirely disjoint in that there are no groups of obs rows between the 2 populations greater than 1.

They should be entirely disjoint, but maybe I'm misunderstanding your concern.

prathapsridharan commented 8 months ago

Note that you'd still have to verify that each individual query has single-valued treatment column and that the two query results produce rows with different treatment values.

@atolopko-czi - So are you saying that each group is distinguished by the values in the treatment-column such that the intersection of unique values for treatment-column between groupA and groupB is the empty set? That is, set(groupA.treatment.unique_values) & set(groupB.treatment.unique_values) is {} (empty)?

If so, then that implies that len(treatment.unique_values) across both groups could be any positive integer, say 5, but our current implementation asserts that len(treatment.unique_values) == 2 across both groups right?

prathapsridharan commented 8 months ago

They should be entirely disjoint, but maybe I'm misunderstanding your concern.

@atolopko-czi - Oh I see. They are entirely disjoint precisely because set(groupA.treatment.unique_values) & set(groupB.treatment.unique_values) == {} correct?

prathapsridharan commented 8 months ago

@pablo-gar @atolopko-czi - Regarding performance requirements

Can complete under 10 min for large comparisons millions of cells.

The new manuscript mentions that this case of 10^6 cells takes 13 min on single core (non-parallelized) and 2-3 min on 6 cores (parallelized) for the vanilla method (using bootstrapping). I assume most of the census users will have a multicore machine available (at least 2 cores if not 4+ cores).

Is the runtime requirement of 10 min for a non-parallel implementation of the census-memento? (I am calling it the census memento vs vanilla memento to distinguish our implementation from the sampling and bootstrapping implementation in the paper)