Memento cube optimizations

atolopko-czi commented 9 months ago

[x] Benchmark and record time/cost of building and storing cube
[ ] With PM, determine if additional build or storage costs need to be optimized. Determine targets.
[ ] Optimize build efficiency, as needed
[x] Optimize storage size, as needed

atolopko-czi commented 8 months ago

As of commit c311ec8105a3a2fead0ac9a6d67031ae63f4bd89, after ~22% completion of full run on local Census 2023-10-23, running on r6idn.24xlarge, estimated time is stable at ~14 hrs:

2023-12-21 20:07:47 69623   INFO     Pass 2: Completed 518 of 3382 batches, batches=15.3%, cells=21.6%, elapsed=3:01:12.931278, est. total time=13:57:36.498477, est. remaining time=10:56:23.567199

Ultimately, it took ~20hrs (estimates likely inaccurate due to high variability in per-cell nnz counts, some ranges of cells having much higher nnz):

2023-12-22 13:46:14 69623   INFO     Pass 2: Completed 3382 of 3382 batches, batches=100.0%, cells=100.0%, elapsed=20:39:39.421186, est. total time=20:39:39.421186, est. remaining time=0:00:00

Cube size:

$ du -sh /mnt/census/estimators-cube-profile
64G     /mnt/census/estimators-cube-profile

atolopko-czi commented 8 months ago

Profiler flamegraph for single obs batch, for command:

PROFILE_MODE=true python -O -m estimators_cube_builder --cube-uri /mnt/census/estimators-cube-profile-2 --experimen
t-uri /mnt/census/2023-10-23/census_data/homo_sapiens 2>&1 | tee /mnt/census/build_profile_2.log
2023-12-21 19:08:54 378819  INFO     Pass 1: Processing 36227903 cells and 60664 genes
2023-12-21 19:08:54 378819  INFO     Pass 1: Compute Approx Size Factors
2023-12-21 19:09:32 378819  INFO     Saved `obs_with_size_factor` TileDB Array
2023-12-21 19:09:32 378819  INFO     Pass 2: Compute Estimators
2023-12-21 19:09:33 378819  INFO     Pass 2: Processing 36227903 cells and 60664 genes
2023-12-21 19:10:12 378819  INFO     Pass 2: Created new estimators cube
2023-12-21 19:11:08 378819  INFO     Pass 2: Start X batch 1, cells=8254, nnz=12209544
2023-12-21 19:45:52 378819  INFO     Pass 2: End X batch 1, cells=8254, nnz=12209544
2023-12-21 19:46:32 378819  INFO     Pass 2: Writing to estimator cube.
2023-12-21 19:47:10 378819  INFO     Validating estimators cube
2023-12-21 19:53:30 378819  INFO     Validation complete

atolopko-czi commented 8 months ago

On atol/memento/880-cube-builder-optimizations--normalize-obs-dims branch, I introduced an obs_groups array, splitting out the categorical dims columns from the estimators cube. The latter now has an obs_group_joinid column, along with the original estimators columns. The obs_groups array is used for querying the cube, and the result is then joined via the obs_group_joinid into the estimator cube. This allows for a drastically reduced cube size, and potentially faster grouping operations during cube building. The impact on DE querying performance is not clear, but expect it to be improved as well. Consider that the filtering query is performed against a much smaller array, and the resultant obs_group_joinid fetch from the estimators array should be very efficient, as that is the primary dimension of the array.

Full cube build test is in progress.

atolopko-czi commented 8 months ago

Profiling for commit 06f8d87:

$ PROFILE_MODE=1 python -O -m estimators_cube_builder --cube
-uri /mnt/census/estimators-cube-profile-single-06f8d87/ --experiment-uri /mnt/census/2023-10-23/census_data/homo_sapiens 2>&1
| tee /mnt/census/build_profile_single-06f8d87.log
2024-01-04 13:06:07 2581260 INFO     Pass 1: Processing 36227903 cells and 60664 genes
2024-01-04 13:06:07 2581260 INFO     Pass 1: Compute Approx Size Factors
2024-01-04 13:06:59 2581260 INFO     Saved `obs_with_size_factor` TileDB Array
2024-01-04 13:06:59 2581260 INFO     Pass 2: Compute Estimators
2024-01-04 13:07:00 2581260 INFO     Pass 2: Processing 36227903 cells and 60664 genes
2024-01-04 13:07:19 2581260 INFO     Pass 2: Computing obs groups
2024-01-04 13:08:35 2581260 INFO     Pass 2: Creating new estimators cube
2024-01-04 13:08:55 2581260 INFO     Pass 2: Starting estimators computation
2024-01-04 13:09:19 2581260 INFO     Pass 2: Start X batch 1, cells=8254, nnz=12209544
2024-01-04 13:54:05 2581260 INFO     Pass 2: End X batch 1, cells=8254, nnz=12209544
2024-01-04 13:54:05 2581260 INFO     Pass 2: Writing to estimator cube.
2024-01-04 13:54:13 2581260 INFO     Validating estimators cube
2024-01-04 14:01:00 2581260 INFO     Validation complete

atolopko-czi commented 8 months ago

As of 06f8d87, run time is unchanged at 20h 25m, but cube size is down from 60GB to 23G:

2024-01-04 16:23:51 2954    INFO     Pass 2: Completed 3382 of 3382 batches, batches=100.0%, cells=100.0%, elapsed=20:25:01.332278, est. total time=20:25:01.332278, est. remaining time=0:00:00

$ du -sh estimators-cube-profile-06f8d87/
23G     estimators-cube-profile-06f8d87/

Not consolidated, but disk space change should be negligible after consolidation.

Next step would be to try 32bit floats over 64 bit floats.

atolopko-czi commented 8 months ago

Have merged the current work to the epic branch. Additional speed/size optimizations will be addressed on new branches.

atolopko-czi commented 8 months ago

Rewriting the estimators array to use 32bit floats reduced size from 23GB to 13GB:

$ du -sh /mnt/census/estimators-cube-profile-06f8d87{,.float32}/estimators/
23G     /mnt/census/estimators-cube-profile-06f8d87/estimators/
13G     /mnt/census/estimators-cube-profile-06f8d87.float32/estimators/

atolopko-czi commented 8 months ago

There were actually more estimators to be removed (only n_obs, mean, and sem are used). After removal, using float32, and consolidating, the estimators array is now 8.3GB:

$ du -sh /mnt/census/estimators-cube-profile-06f8d87/estimators{,.float32-min-dims}
23G     /mnt/census/estimators-cube-profile-06f8d87/estimators
8.3G    /mnt/census/estimators-cube-profile-06f8d87/estimators.float32-min-dims

atolopko-czi commented 8 months ago

As of commit 70a7705:

Cube includes only these estimators: n_obs, mean, sem
Size: 17G (float64); if we use float32, cube < 10G
Build time: 17 hr using r6id.24xlarge (96 VCPU, 768GiB, $7/hr)

chanzuckerberg / cellxgene-census

Memento cube optimizations #880