Open atolopko-czi opened 1 month ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 91.12%. Comparing base (
c796baf
) to head (095c34c
).:exclamation: Current head 095c34c differs from pull request most recent head cd84ef2. Consider uploading reports for the commit cd84ef2 to get more accurate results
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Fixes #1119
Generate the partitions by splitting on obs_joinids instead of on soma chunks.
Previously, when the number of soma chunks was not evenly divisible by the partition count ("world size"), partition sizes could become drastically imbalanced.
The new algo for shuffled, partitioned ids is:
When performing partitioning without shuffling, the soma chunks from step 1 may end up being split across two partitions. This introduces a minor read performance hit, since each original soma chunk that is split across partitions will result in 2 read operations instead of 1.
When performing partitioning with shuffling, each soma chunk from step 1 may end up being split across two soma chunks in step 5. This is because the latter chunks will not be "aligned" with the former chunks when the partition sizes are not evenly divisible by the chunk size. This introduces a 2x read performance hit in the worst case. Consider that each partition-local chunk may be composed of 2 original chunks that are not stored contiguously on disk. This could be addressed by explicitly aligning the step 5 chunks with the step 1 chunks, but this has not been implemented in the current PR.
Now, the worst case imbalance is that any two partitions will only differ in size by 1 row. This is due to the fact the numpy.array_split() evenly distributes the imbalances across the splits that it produces. Since only imbalances of size 1 can occur, it should not be necessary to drop rows or pad rows in any partition.
New unit test cases are included in
test_distributed__returns_data_partition_for_rank
, which now exercises the imbalanced cases as well. I've also addedtest_distributed__returns_data_partition_for_rank_globally_shuffled
to demonstrate that global shuffling is maintained.