chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 22 forks source link

Feature request: addition of data subsampling capability to cell-census API #304

Open Ivana-Jelic opened 1 year ago

Ivana-Jelic commented 1 year ago

Motivation

There is a lot of redundancy in Cell Census data; being able to subsample data efficiently (random, based on heterogeneity, data quality, or other characteristics and methods) to use it for training ML models would be useful to the broad ML community. Furthermore, as Cell Census continues to grow, it will be increasingly important to provide this feature to democratize access to our data truly.

Definition of Done

Implement data subsampling methods based on user-specified metadata (e.g., cell_type and/or assay, etc.)

Example implementation:

sample = cell_census.get_anndata(
    census = census,
    organism = "Homo sapiens",
    obs_value_filter = "tissue_general == 'lung' ,
    sampler = "random", # choose from different options, followed by relevant arguments
    strata = "cell_type",
    sample size = 0.10)

Example queries: "10% per available assay of all t-cells" or "5% per cell type of all cells from lung tissue".

Tasks

Ivana-Jelic commented 1 year ago

Interim solution: subsampling filtered dataset using pandas df.groupby(...).sample(frac=0.1, random_state=123) function. This requires working with a large subset of Cell Census locally.

atolopko-czi commented 1 year ago

Ultimately, I suspect we might want this supported at the TileDB-SOMA level in ExperimentAxisQuery. This would naturally bubble up to Cenus API convenience methods, like get_anndata().