dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.55k stars 712 forks source link

Add Prometheus gauge for task groups #8661

Closed hendrikmakait closed 3 weeks ago

hendrikmakait commented 3 weeks ago

As mentioned in #8656, a large number of task groups can cause significant strain on the scheduler. This PR adds a gauge to the Prometheus metric to monitor this.

hendrikmakait commented 3 weeks ago

cc @ntabris

github-actions[bot] commented 3 weeks ago

Unit Test Results

_See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests._

    29 files  +     1      29 suites  +1   11h 15m 20s :stopwatch: + 1h 45m 11s  4 054 tests  -      4   3 952 :white_check_mark: +    9     97 :zzz:  -   9  5 :x: ±0  55 841 runs  +10 209  53 672 :white_check_mark: +9 917  2 163 :zzz: +309  6 :x: ±0 

For more details on these failures, see this check.

Results for commit ae372cf4. ± Comparison against base commit 9fae5dac.

This pull request removes 13 and adds 9 tests. Note that renamed tests count towards both. ``` distributed.protocol.tests.test_arrow distributed.protocol.tests.test_collection distributed.protocol.tests.test_highlevelgraph distributed.protocol.tests.test_numpy distributed.protocol.tests.test_pandas distributed.shuffle.tests.test_graph distributed.shuffle.tests.test_merge distributed.shuffle.tests.test_merge_column_and_index distributed.shuffle.tests.test_metrics distributed.shuffle.tests.test_rechunk … ``` ``` distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[False] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[report_args0] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[1] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[False] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[True] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[False] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[report_args0] distributed.http.scheduler.tests.test_scheduler_http ‑ test_prometheus_collect_task_groups ```

:recycle: This comment has been updated with latest results.