chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 22 forks source link

Use subset of covariates for diff expr #942

Closed atolopko-czi closed 7 months ago

atolopko-czi commented 9 months ago

As of 8f3ed5fa0fc05dad381adda79e2cf502fe9e43bc, the memento diff expr method uses all cube dimensions as covariates when performing the computation. However, it may desirable to support allowing the user to only specify a subset of dimensions as the covariates.

To support diff expr using only a subset of dimensions, we must:

  1. Update the cube to replace the sem value (standard error of the mean) with statistics that can be aggregated to compute the sem: sum (sum of expression values), sumsq(sum of squares of expression values).
  2. Update the diff expr computation to compute sem on the fly using sum and sumsq (how)
  3. Update the compute_all function to take a list of covariates. The computed design matrix should be created for the specified covariates.
  4. Update the query_estimators function to perform the necessary row aggregations to compute the sem for each distinct tuple of covariate/dimension values.
  5. Update the compute_all_estimators_for_gene function to remove the dense_gene_data() function call. See comment.