chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 19 forks source link

[python] Balance distributed partition size more evenly #1135

Open atolopko-czi opened 1 month ago

atolopko-czi commented 1 month ago

Fixes #1119

Generate the partitions by splitting on obs_joinids instead of on soma chunks.

Previously, when the number of soma chunks was not evenly divisible by the partition count ("world size"), partition sizes could become drastically imbalanced.

The new algo for shuffled, partitioned ids is:

  1. Chunk global ids into soma chunks
  2. Globally shuffle the chunks (do not shuffle the chunks internally)
  3. Create a flat list of global ids again
  4. Partition the flat list of globals id, and select the ids for the current partition
  5. Chunk into soma chunks (again) within the partition

When performing partitioning without shuffling, the soma chunks from step 1 may end up being split across two partitions. This introduces a minor read performance hit, since each original soma chunk that is split across partitions will result in 2 read operations instead of 1.

When performing partitioning with shuffling, each soma chunk from step 1 may end up being split across two soma chunks in step 5. This is because the latter chunks will not be "aligned" with the former chunks when the partition sizes are not evenly divisible by the chunk size. This introduces a 2x read performance hit in the worst case. Consider that each partition-local chunk may be composed of 2 original chunks that are not stored contiguously on disk. This could be addressed by explicitly aligning the step 5 chunks with the step 1 chunks, but this has not been implemented in the current PR.

Now, the worst case imbalance is that any two partitions will only differ in size by 1 row. This is due to the fact the numpy.array_split() evenly distributes the imbalances across the splits that it produces. Since only imbalances of size 1 can occur, it should not be necessary to drop rows or pad rows in any partition.

New unit test cases are included in test_distributed__returns_data_partition_for_rank, which now exercises the imbalanced cases as well. I've also added test_distributed__returns_data_partition_for_rank_globally_shuffled to demonstrate that global shuffling is maintained.

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.12%. Comparing base (c796baf) to head (095c34c).

:exclamation: Current head 095c34c differs from pull request most recent head cd84ef2. Consider uploading reports for the commit cd84ef2 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1135 +/- ## ======================================= Coverage 91.12% 91.12% ======================================= Files 77 77 Lines 5857 5860 +3 ======================================= + Hits 5337 5340 +3 Misses 520 520 ``` | [Flag](https://app.codecov.io/gh/chanzuckerberg/cellxgene-census/pull/1135/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/chanzuckerberg/cellxgene-census/pull/1135/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg) | `91.12% <100.00%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.