dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.55k stars 712 forks source link

Reduce task group count for partial P2P rechunks #8655

Closed hendrikmakait closed 4 weeks ago

hendrikmakait commented 1 month ago

Closes #8656

Previously, independent partials of a P2P rechunk would create their own independent task groups. This could lead to an explosion of the task group count and slow down the scheduler. This PR fixes that.

github-actions[bot] commented 1 month ago

Unit Test Results

_See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests._

    28 files   -     1      28 suites   - 1   10h 51m 38s :stopwatch: + 52m 45s  4 053 tests  -     4   3 950 :white_check_mark: +   10     97 :zzz:  -   9  6 :x:  - 1  54 399 runs  +6 165  52 292 :white_check_mark: +5 997  2 099 :zzz: +185  8 :x: ±0 

For more details on these failures, see this check.

Results for commit 2c26a72d. ± Comparison against base commit b3f8fcd0.

This pull request removes 13 and adds 9 tests. Note that renamed tests count towards both. ``` distributed.protocol.tests.test_arrow distributed.protocol.tests.test_collection distributed.protocol.tests.test_highlevelgraph distributed.protocol.tests.test_numpy distributed.protocol.tests.test_pandas distributed.shuffle.tests.test_graph distributed.shuffle.tests.test_merge distributed.shuffle.tests.test_merge_column_and_index distributed.shuffle.tests.test_metrics distributed.shuffle.tests.test_rechunk … ``` ``` distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[False] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[report_args0] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[1] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[False] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[True] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[False] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[report_args0] distributed.shuffle.tests.test_rechunk ‑ test_partial_rechunk_taskgroups ```

:recycle: This comment has been updated with latest results.

hendrikmakait commented 1 month ago

@hendrikmakait who around do you think would be good to review more deeply here? Do you want me to take a longer look? Maybe Alex has knowledge here?

Alex might have some knowledge here, but he's out today. We have reasonable test coverage for rechunking, so a careful sanity check should suffice here. (In fact, it caught an issue I had introduced early in creating this fix.) @wence- would probably be a good fit, but I'm not sure whether he has any capacity.

I'm currently running an A/B test as well to compare performance.

mrocklin commented 1 month ago

Maybe I'm saying "let me know if I my getting engaged would meaningfully accelerate things"

hendrikmakait commented 1 month ago

Maybe I'm saying "let me know if I my getting engaged would meaningfully accelerate things"

If you have the time to review this today and approve it, that's the quickest way of getting this PR merged. From previous interactions, it sounded like Alex hasn't dug deeply into P2P before, so he probably doesn't know more than you here.

I'm still waiting for A/B test results from this run: https://github.com/coiled/benchmarks/actions/runs/9257990760/job/25467247783

hendrikmakait commented 4 weeks ago

Thanks for the review, @wence-!