Open prathapsridharan opened 1 month ago
A specific comment in the above slack thread recapitulated here because it is useful:
K
is the number of data_points
. Now that data_point can be viewed as a single cell or chunk of contiguous rows. The chunk of contiguous rows conception is used here for I/O efficiency as a read query is made more efficient when you fetch more data in that read query.
The thinking is picking K
random sections of the entire corpus allows us to get a certain desired coverage of datasets. With the chunks (or contiguous rows) concept you can think of the entire census split evenly into sections/chunks. Each section/chunk is then a data_point.
Even when you don't actively do any chunking, you can view the entire census as a collection of sections/chunks where each chunk is of length 1 row. But that would be inefficient in terms of I/O during reads from the storage medium. So think of each chunk of length > 1 (for I/O efficiency) and determine that chunk size based on the memory budget - number of actual rows that can be loaded in memory. Thus the chunk_size = memory_budget_num_rows // K
A calculation of expected number of datasets represented when K
random datapoints _are selected from the census using the real distribution of datapoints (observations) across datasetids. This was calculated by @pablo-gar :
K expected_n_datasets
1 500 197.9385
2 1000 285.0254
3 1500 342.0426
4 2000 383.4680
5 2500 415.3354
6 3000 440.7713
7 3500 461.6108
8 4000 479.0251
9 4500 493.8070
10 5000 506.5175
11 5500 517.5673
12 6000 527.2646
13 6500 535.8462
14 7000 543.4966
15 7500 550.3621
16 8000 556.5600
17 8500 562.1854
18 9000 567.3163
19 9500 572.0169
20 10000 576.3408
21 10500 580.3331
22 11000 584.0316
23 11500 587.4689
24 12000 590.6724
25 12500 593.6661
26 13000 596.4705
27 13500 599.1037
28 14000 601.5811
29 14500 603.9167
30 15000 606.1226
31 15500 608.2095
32 16000 610.1870
33 16500 612.0636
34 17000 613.8470
35 17500 615.5440
36 18000 617.1608
37 18500 618.7030
38 19000 620.1757
39 19500 621.5835
40 20000 622.9306
Here is a histogram of the distribution of observations across dataset-ids in the census (computed by @pablo-gar ):
The histogram in the previous comment shows on zenhub but not on this GH issue. Reattached in this comment as well:
There is potential an opportunity to implement a shuffling algorithm that closely approximates a random sample shuffle. The following is a recapitulation of slack discussion
The high level description of the algorithm is:
chunk_size
Let's call this method
scatter_gather_shuffle
. This method attempts to strike a balance between randomness and good I/O performance (hence reading chunks rather than individual data points which would be less efficient in terms of I/O)The following algorithm is stated in a way such that, if accepted, it could potentially be packaged in a separate tensor library (like pytorch) or a dataloader library and thus be generally useful to many types of ML training workloads.
The following algorithm assumes that the data is uniformly distributed across some buckets. However, it is possible for the algorithm to take in a real distribution (a probability mass function) or an analytic distribution (ex: exponential distribution, poisson distribution, etc) and perform the expectation calculations based on the input data distribution. Also, the description of the algorithm is set in the context of
cellxgene_census
where the data is is bucketed bydataset_id
, however, the algorithm is generally applicable for any dataset that naturally fall into buckets.The central problem is to determine the number of random chunks to gather across the data and the size of each such chunk to then concatenate and shuffle so as to yield a sequence of data points that is satisfactorily random.
"Satisfactorily Random" is something that the user must define here. One definition of "satisfactorily random" that is simple to encode and generally useful is if the user knows how data points are bucketed, then a "satisfactorily random" sequence of K data_points would represent some desired fraction of the buckets. To put it more concretely, the census has 60 million observations (data_points) distributed across 567 datasets (the buckets). If K observations are drawn at random from the entire corpus what is the expected number of datasets covering these K random points? I think we could work that out analytically:
Thus for K = 500, expected number of datasets represented, E[Y] = 567 * (1 - (566/567)**500) = 332. If K = 2000, E[Y] = 550 (almost all datasets) - this is how I arrived at the 2000 random chunks. Since we want good I/O efficiency, dividing the memory budget (specified in number of rows) by the number of chunks gives us the chunk size: 128_000/2000 = 64
Pseudocode for the algorithm: