Feature request: addition of data subsampling capability to cell-census API

Ivana-Jelic commented 1 year ago

Motivation

There is a lot of redundancy in Cell Census data; being able to subsample data efficiently (random, based on heterogeneity, data quality, or other characteristics and methods) to use it for training ML models would be useful to the broad ML community. Furthermore, as Cell Census continues to grow, it will be increasingly important to provide this feature to democratize access to our data truly.

Definition of Done

Implement data subsampling methods based on user-specified metadata (e.g., cell_type and/or assay, etc.)

Example implementation:

sample = cell_census.get_anndata(
    census = census,
    organism = "Homo sapiens",
    obs_value_filter = "tissue_general == 'lung' ,
    sampler = "random", # choose from different options, followed by relevant arguments
    strata = "cell_type",
    sample size = 0.10)

Example queries: "10% per available assay of all t-cells" or "5% per cell type of all cells from lung tissue".

Tasks

P0: implement stratified random based on user-specified strata
P1: implement sampling based on data heterogeneity per strata (for example, 5% of the most representative cells given a cell type)
P1: implement sampling based on scSampler method (pending POC, which is in progress)

Ivana-Jelic commented 1 year ago

Interim solution: subsampling filtered dataset using pandas df.groupby(...).sample(frac=0.1, random_state=123) function. This requires working with a large subset of Cell Census locally.

atolopko-czi commented 1 year ago

Ultimately, I suspect we might want this supported at the TileDB-SOMA level in ExperimentAxisQuery. This would naturally bubble up to Cenus API convenience methods, like get_anndata().

chanzuckerberg / cellxgene-census