Open Ivana-Jelic opened 1 year ago
Interim solution: subsampling filtered dataset using pandas df.groupby(...).sample(frac=0.1, random_state=123)
function. This requires working with a large subset of Cell Census locally.
Ultimately, I suspect we might want this supported at the TileDB-SOMA level in ExperimentAxisQuery
. This would naturally bubble up to Cenus API convenience methods, like get_anndata().
Motivation
There is a lot of redundancy in Cell Census data; being able to subsample data efficiently (random, based on heterogeneity, data quality, or other characteristics and methods) to use it for training ML models would be useful to the broad ML community. Furthermore, as Cell Census continues to grow, it will be increasingly important to provide this feature to democratize access to our data truly.
Definition of Done
Implement data subsampling methods based on user-specified metadata (e.g., cell_type and/or assay, etc.)
Example implementation:
Example queries: "10% per available assay of all t-cells" or "5% per cell type of all cells from lung tissue".
Tasks