Provide a method for users to get a subsample of data

chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census

https://chanzuckerberg.github.io/cellxgene-census/

MIT License

75 stars 19 forks source link

Provide a method for users to get a subsample of data #1026

Open hthomas-czi opened 4 months ago

hthomas-czi commented 4 months ago

If the users' query returns a large amount of data, this helps them perform quicker analyses by using a subset of that data.

User Quote

I’d like a way to get subsampled datasets.

Notes

Emanuele: Could currently be done by subsampling query ids before querying data
Pablo: We should also explore API changes with special focus on get_anndata

ebezzi commented 3 months ago

Requires discovery work to determine what type of subsampling we should support (randomized, stratified with respect to certain variables, ...).

ivirshup commented 2 months ago

I would be curious to hear what the user wants beyond sc.pp.subsample. And what value can be added by including it here. For truly random subsampling, you'll probably need to download the same amount of data anyways, right?

prathapsridharan commented 2 months ago

For truly random subsampling, you'll probably need to download the same amount of data anyways, right?

Adding to what @ivirshup says here:

Let N be the total number of obs rows. If K is the number of random obs rows read and T is the number of tiles the rows are batched into whereby reading a single row in a tile will require reading the entire tile and therefore affect I/O efficiency. Then the expected number of tiles read when K random rows are subsampled is:

T * (1 - ((T - 1)/T)^K)

If K is some fraction of N (as is the case for subsampling) and N is reasonably large, then this equation effectively evaluates to ~T. So subsampling here, as @ivirshup points out, would not help I/O performance.

But maybe it could have lower memory footprint since the data in memory is the subsampled data of the query and not the entire result of the query right?

ivirshup commented 2 months ago

This is also something where being able to return a dask array would also solve the problem. If a dask array is returned, then the user subsamples via scanpy before bringing it into memory there would still be minimal peak memory without needing to implement subsampling in census.

prathapsridharan commented 2 months ago

This is also something where being able to return a dask array would also solve the problem

@ivirshup - Which "problem" is being solved with dask array here? Are you saying that I/O efficiency would be improved? If so I would be interested to know how that could be possible with an example (we can talk in real time about this).

Or are you saying that having a dask array means we can simply use scanpy and just avoid the work of having to implement subsampling in census?

ivirshup commented 2 months ago

The problem of memory overhead. E.g. you are able to do the subsampling with potentially lower peak memory than loading the entire object into memory and then subsampling. Instead, each chunk is subsampled independently. I believe this is the same as the memory footprint you were describing.

I'm saying that if a dask array was returned, scanpy's subsampling will just operate on that, which gets gives us to lower memory footprint without having to deal with any sub-sampling functionality in the census codebase.