dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Fine performance metrics meta-issue #7665

Open crusaderky opened 1 year ago

crusaderky commented 1 year ago

XREFs

In #7586, we started collecting very granular metrics on how workers are spending their time. Demo: https://gist.github.com/crusaderky/a97f870c51260e63a1c14c20b762f666

As of that PR, we collect metrics in Worker.digests_total about:

This issue is a meta-tracker of all potential follow-ups, as well as a place to discuss high level design and cost/benefit ratios holistically.

The follow-ups can be broken down into two high level threads:

Improve quality and usability of collected data

What we do with the data

Finishing touches

crusaderky commented 1 year ago

Current data schema

7701 (currently in main git tip; will be in release 2023.4.1) introduces a new mapping Scheduler.cumulative_worker_metrics.

These are ever-increasing key->float amount pairs.

The keys are as follows. Note that there may be (there are) more keys than the ones listed below, and while all keys listed below are tuples, some keys may be bare strings.

Worker.execute() metrics

Format

("execute", <task prefix>, <activity>, <unit>) -> float amount

All metrics with the same unit are additive. In a hypothetical, perfect scenario where all workers run tasks back to back non-stop, they would add up to the number of threads on the cluster multiplied by the cluster uptime (although seceded tasks will mess that up; read #7675). Metrics are captured upon task termination so, in case of long-running tasks, if you scrape them frequently you may observe artificial spikes that exceed your scraping interval (read #7677).

Individual labels

Deserialize run_spec

Unspill inputs

Run task in thread

Spill output (will disappear after #4424)

Delta to end-to-end runtime as seen from the worker state machine

Future additions

Failed tasks

Time wasted on non-successful tasks. These metrics are instead of the time metrics listed above.

Worker.gather_dep metrics

Format

("gather-dep", <activity>, <unit>) -> float amount

All metrics with the same unit are additive. A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster. Metrics are captured upon termination of a gather_dep call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.

Individual labels

Worker.gather_dep() method

Spill output (will disappear after #4424)

Delta to end-to-end runtime as seen from the worker state machine

Future additions

Failed transfers

Time wasted on non-successful transfers. These metrics are instead of the time metrics listed above.

Worker.get_data() metrics

Format

("get-data", <activity>, <unit>) -> float amount

All metrics with the same unit are additive. A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster. Metrics are captured upon termination of a get_data call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.

Individual labels

Unspill

Send over the network

crusaderky commented 1 year ago

Summary from an offline meeting with @fjetter, @hendrikmakait and @ntabris :