Open atolopko-czi opened 9 months ago
As of commit c311ec8105a3a2fead0ac9a6d67031ae63f4bd89, after ~22% completion of full run on local Census 2023-10-23
, running on r6idn.24xlarge, estimated time is stable at ~14 hrs:
2023-12-21 20:07:47 69623 INFO Pass 2: Completed 518 of 3382 batches, batches=15.3%, cells=21.6%, elapsed=3:01:12.931278, est. total time=13:57:36.498477, est. remaining time=10:56:23.567199
Ultimately, it took ~20hrs (estimates likely inaccurate due to high variability in per-cell nnz counts, some ranges of cells having much higher nnz):
2023-12-22 13:46:14 69623 INFO Pass 2: Completed 3382 of 3382 batches, batches=100.0%, cells=100.0%, elapsed=20:39:39.421186, est. total time=20:39:39.421186, est. remaining time=0:00:00
Cube size:
$ du -sh /mnt/census/estimators-cube-profile
64G /mnt/census/estimators-cube-profile
Profiler flamegraph for single obs batch, for command:
PROFILE_MODE=true python -O -m estimators_cube_builder --cube-uri /mnt/census/estimators-cube-profile-2 --experimen
t-uri /mnt/census/2023-10-23/census_data/homo_sapiens 2>&1 | tee /mnt/census/build_profile_2.log
2023-12-21 19:08:54 378819 INFO Pass 1: Processing 36227903 cells and 60664 genes
2023-12-21 19:08:54 378819 INFO Pass 1: Compute Approx Size Factors
2023-12-21 19:09:32 378819 INFO Saved `obs_with_size_factor` TileDB Array
2023-12-21 19:09:32 378819 INFO Pass 2: Compute Estimators
2023-12-21 19:09:33 378819 INFO Pass 2: Processing 36227903 cells and 60664 genes
2023-12-21 19:10:12 378819 INFO Pass 2: Created new estimators cube
2023-12-21 19:11:08 378819 INFO Pass 2: Start X batch 1, cells=8254, nnz=12209544
2023-12-21 19:45:52 378819 INFO Pass 2: End X batch 1, cells=8254, nnz=12209544
2023-12-21 19:46:32 378819 INFO Pass 2: Writing to estimator cube.
2023-12-21 19:47:10 378819 INFO Validating estimators cube
2023-12-21 19:53:30 378819 INFO Validation complete
On atol/memento/880-cube-builder-optimizations--normalize-obs-dims
branch, I introduced an obs_groups
array, splitting out the categorical dims columns from the estimators cube. The latter now has an obs_group_joinid
column, along with the original estimators columns. The obs_groups
array is used for querying the cube, and the result is then joined via the obs_group_joinid
into the estimator cube. This allows for a drastically reduced cube size, and potentially faster grouping operations during cube building. The impact on DE querying performance is not clear, but expect it to be improved as well. Consider that the filtering query is performed against a much smaller array, and the resultant obs_group_joinid
fetch from the estimators array should be very efficient, as that is the primary dimension of the array.
Full cube build test is in progress.
Profiling for commit 06f8d87:
$ PROFILE_MODE=1 python -O -m estimators_cube_builder --cube
-uri /mnt/census/estimators-cube-profile-single-06f8d87/ --experiment-uri /mnt/census/2023-10-23/census_data/homo_sapiens 2>&1
| tee /mnt/census/build_profile_single-06f8d87.log
2024-01-04 13:06:07 2581260 INFO Pass 1: Processing 36227903 cells and 60664 genes
2024-01-04 13:06:07 2581260 INFO Pass 1: Compute Approx Size Factors
2024-01-04 13:06:59 2581260 INFO Saved `obs_with_size_factor` TileDB Array
2024-01-04 13:06:59 2581260 INFO Pass 2: Compute Estimators
2024-01-04 13:07:00 2581260 INFO Pass 2: Processing 36227903 cells and 60664 genes
2024-01-04 13:07:19 2581260 INFO Pass 2: Computing obs groups
2024-01-04 13:08:35 2581260 INFO Pass 2: Creating new estimators cube
2024-01-04 13:08:55 2581260 INFO Pass 2: Starting estimators computation
2024-01-04 13:09:19 2581260 INFO Pass 2: Start X batch 1, cells=8254, nnz=12209544
2024-01-04 13:54:05 2581260 INFO Pass 2: End X batch 1, cells=8254, nnz=12209544
2024-01-04 13:54:05 2581260 INFO Pass 2: Writing to estimator cube.
2024-01-04 13:54:13 2581260 INFO Validating estimators cube
2024-01-04 14:01:00 2581260 INFO Validation complete
As of 06f8d87, run time is unchanged at 20h 25m, but cube size is down from 60GB to 23G:
2024-01-04 16:23:51 2954 INFO Pass 2: Completed 3382 of 3382 batches, batches=100.0%, cells=100.0%, elapsed=20:25:01.332278, est. total time=20:25:01.332278, est. remaining time=0:00:00
$ du -sh estimators-cube-profile-06f8d87/
23G estimators-cube-profile-06f8d87/
Not consolidated, but disk space change should be negligible after consolidation.
Next step would be to try 32bit floats over 64 bit floats.
Have merged the current work to the epic branch. Additional speed/size optimizations will be addressed on new branches.
Rewriting the estimators array to use 32bit floats reduced size from 23GB to 13GB:
$ du -sh /mnt/census/estimators-cube-profile-06f8d87{,.float32}/estimators/
23G /mnt/census/estimators-cube-profile-06f8d87/estimators/
13G /mnt/census/estimators-cube-profile-06f8d87.float32/estimators/
There were actually more estimators to be removed (only n_obs, mean, and sem are used). After removal, using float32, and consolidating, the estimators array is now 8.3GB:
$ du -sh /mnt/census/estimators-cube-profile-06f8d87/estimators{,.float32-min-dims}
23G /mnt/census/estimators-cube-profile-06f8d87/estimators
8.3G /mnt/census/estimators-cube-profile-06f8d87/estimators.float32-min-dims
As of commit 70a7705: