Computational biologists can identify genes underlying the biological differences between groups of Census cells, via fast, batch-corrected atlas-level differential gene expression.

[Re-written by @pablo-gar]

Context

Census launched in 2023 with its baseline functionality for efficient data access to the largest aggregation of standardized single-cell data, providing a stable data structure and API.

Census has demonstrated its utility to accelerate workflows for “expert data analysts” reducing >3 months of work to less than a day, in particular around data gathering and wrangling. Please the user segments document.

In 2024 we aim to continue our support of “expert data analysts” by improving upon existing features around data access patterns, but our bigger focus will become “data and tool consumers” and high-value “tool builders” (defined as those with high reach to “data and tool consumers”).

One of the main workflows that “data and tool consumers” follow when analyzing single-cell data is differential gene expression to identify underlying biological sources that differentiate different groups of cells.

This epic encapsulates user-stories and requirements for Census to support differential gene expression.

Stories

I am a computational biologist who wants to identify significantly differentially expressed genes between any two groups of cells in Census as defined by the Census metadata. I want to identify genes that underlie biological differences rather than technical differences due to the multi-dataset nature of Census.
I am a computational biologist who wants to identify significantly differentially expressed genes between any two groups of cells in Census as defined by the Census metadata. . An while I'd like to account for batch effects present in Census, I'd like to have flexibility to select biological covariates to account for based on the hypothesis I'm testing, for example in some cases I would not like to account for age or ancestry during my analysis.

Product requirements

An experimental API that can perform differential gene expression between any two groups of cells from Census as defined by the Census metadata while accounting for known batch effects.

Ingests the following:
- [Experiment -- a tiledbsoma Experiment from the Census data] OR ["census object" or "census version" and experiment name].
  - Either option is valid, defer to engineers on what's best based on the backend implementation.
- An observation filter -- an optional filter to subset the experiment data based on the Census standard metadata.
- A feature filter -- an optional filter to subset the genes to test.
- An observation "group A assertion" -- an assertion that defines group A based on obs metadata, the assertion should be validated against the observation filter above (i.e. it is a subset of the filter)
- An observation "group B assertion" -- same as above for group B.
- Covariates to account for -- what covariates to account for when performing differential gene expression.
Returns the following:
- A list of all genes as defined in the value filter above.
- The list of genes should have associated statistics encoding significance (e.g. p-value) and effect size after comparing their expression in group A vs group B.
  - The method used for the comparison should account for the effect of the covariates indicated by the user.
  - Cells in group A and B should be restricted to the obersvation value filter defined by the user
- Has the following performance requirements.
- Can complete under 10 min for large comparisons hundreds of thousands of cells.
- Can complete under 3 min for medium comparisons, dozens of thousands of cells.

Refinements needed

API design, defer to engineers.
Method to use for comparisons.
- Must account for batch effects.
- Must be validated.
Artifacts necessary for performance improvements (e.g. cubes), defer to engineers.
- If artifacts are needed on a recurrent basis we need to assess the cost for the Census product.

Differential gene expression challenges with Census-scale data

Census is a compilation of hundreds of datasets, batch effects are strongs. Therefore differential gene expression analysis can be heavily influenced by technical signal rather than biological signal.
As of February of 2024 the stable release of Census has more than 40M unique cells, performance of the differential gene expression method underlying this feature is of concern.

Potential methods to perform comparisons

Memento

WIP

T-test with batch effect correction (liner regression with covariates)

WIP

@pablo-gar - Thanks for product requirements write up. Some thoughts (@atolopko-czi @atarashansky - please advise as well):

To distinguish 2 populations to compare, you have stated there needs to be: "obs filter", "feature filter", "group A assertion that is subset of the obs filter", "group B assertion that is subset of the obs filter". I want to clarify if this is the expected way by census users to invoke differential expression.

As an alternative, one could specify: "feature filter", "group A obs query", "group B obs query" and get rid of "obs filter". This more explicitly identifies group A and group B. I am not advocating for it. In fact, there are failure modes with the latter in that one could specify a "group A" and "group B" that are entirely disjoint in that there are no groups of obs rows between the 2 populations greater than 1.

So is the approach you laid out: "obs filter", "feature filter", "group A assertion", "group B assertion" the convention that users will be most comfortable with rather than any other convention? For example, the current code requires the user to provide: "obs filter query" and a "treatment" (ex: sex_ontology_term_id) attribute such that the "treatment" attributed has only 2 distinct values in result set returned by the "obs query filter" - this effectively determines "group A" and "group B". Is this way intuitive enough for the users?

Regarding "The list of genes should have associated statistics encoding significance" - Does the statistic change depending on the methodology we use? For example, if we swap out memento for t-test or some other fancy method? Should we even care about designing to support ability to swap implementations now?
Related to (2) above, in what order the genes should be returned? For example, the current memento implementation returns genes in descending order of z-score. Is that sufficient? How does this change if we have to swap one implementation for another? Should we even care about designing to support ability to swap implementations now?
To address the performance requirements, we will need to identify a priori queries to benchmark (that is, queries comparing millions of obs and queries comparing hundreds of thousands of obs). If you have such queries already please provide them.

I think alignment on (1), (2), and (3) is enough to get started with a tech spec

As an alternative, one could specify: "feature filter", "group A obs query", "group B obs query" and get rid of "obs filter". This more explicitly identifies group A and group B. I am not advocating for it.

Note that you'd still have to verify that each individual query has single-valued treatment column and that the two query results produce rows with different treatment values.

In fact, there are failure modes with the latter in that one could specify a "group A" and "group B" that are entirely disjoint in that there are no groups of obs rows between the 2 populations greater than 1.

They should be entirely disjoint, but maybe I'm misunderstanding your concern.

Note that you'd still have to verify that each individual query has single-valued treatment column and that the two query results produce rows with different treatment values.

@atolopko-czi - So are you saying that each group is distinguished by the values in the treatment-column such that the intersection of unique values for treatment-column between groupA and groupB is the empty set? That is, set(groupA.treatment.unique_values) & set(groupB.treatment.unique_values) is {} (empty)?

If so, then that implies that len(treatment.unique_values) across both groups could be any positive integer, say 5, but our current implementation asserts that len(treatment.unique_values) == 2 across both groups right?

They should be entirely disjoint, but maybe I'm misunderstanding your concern.

@atolopko-czi - Oh I see. They are entirely disjoint precisely because set(groupA.treatment.unique_values) & set(groupB.treatment.unique_values) == {} correct?

@pablo-gar @atolopko-czi - Regarding performance requirements

Can complete under 10 min for large comparisons millions of cells.

The new manuscript mentions that this case of 10^6 cells takes 13 min on single core (non-parallelized) and 2-3 min on 6 cores (parallelized) for the vanilla method (using bootstrapping). I assume most of the census users will have a multicore machine available (at least 2 cores if not 4+ cores).

Is the runtime requirement of 10 min for a non-parallel implementation of the census-memento? (I am calling it the census memento vs vanilla memento to distinguish our implementation from the sampling and bootstrapping implementation in the paper)

chanzuckerberg / cellxgene-census