Open hthomas-czi opened 4 months ago
Requires discovery work to determine what type of subsampling we should support (randomized, stratified with respect to certain variables, ...).
I would be curious to hear what the user wants beyond sc.pp.subsample
. And what value can be added by including it here. For truly random subsampling, you'll probably need to download the same amount of data anyways, right?
For truly random subsampling, you'll probably need to download the same amount of data anyways, right?
Adding to what @ivirshup says here:
Let N
be the total number of obs
rows. If K
is the number of random obs
rows read and T
is the number of tiles the rows are batched into whereby reading a single row in a tile will require reading the entire tile and therefore affect I/O efficiency. Then the expected number of tiles read when K
random rows are subsampled is:
T * (1 - ((T - 1)/T)^K)
If K
is some fraction of N
(as is the case for subsampling) and N
is reasonably large, then this equation effectively evaluates to ~T
. So subsampling here, as @ivirshup points out, would not help I/O performance.
But maybe it could have lower memory footprint since the data in memory is the subsampled data of the query and not the entire result of the query right?
This is also something where being able to return a dask array would also solve the problem. If a dask array is returned, then the user subsamples via scanpy
before bringing it into memory there would still be minimal peak memory without needing to implement subsampling in census.
This is also something where being able to return a dask array would also solve the problem
@ivirshup - Which "problem" is being solved with dask array here? Are you saying that I/O efficiency would be improved? If so I would be interested to know how that could be possible with an example (we can talk in real time about this).
Or are you saying that having a dask array means we can simply use scanpy
and just avoid the work of having to implement subsampling in census
?
The problem of memory overhead. E.g. you are able to do the subsampling with potentially lower peak memory than loading the entire object into memory and then subsampling. Instead, each chunk is subsampled independently. I believe this is the same as the memory footprint you were describing.
I'm saying that if a dask array was returned, scanpy's subsampling will just operate on that, which gets gives us to lower memory footprint without having to deal with any sub-sampling functionality in the census codebase.
If the users' query returns a large amount of data, this helps them perform quicker analyses by using a subset of that data.
User Quote
Notes
get_anndata