crusaderky commented 1 year ago

XREFs

7217
7565
7601
7586

In #7586, we started collecting very granular metrics on how workers are spending their time. Demo: https://gist.github.com/crusaderky/a97f870c51260e63a1c14c20b762f666

As of that PR, we collect metrics in Worker.digests_total about:

Worker.execute, broken down by task prefix and activity, with special treatment for failed and cancelled tasks
Worker.gather_dep, broken down by activity, with special treatment for failed and cancelled transfers
Worker.get_data, broken down by activity
WorkerMemoryMonitor._spill, broken down by activity

This issue is a meta-tracker of all potential follow-ups, as well as a place to discuss high level design and cost/benefit ratios holistically.

The follow-ups can be broken down into two high level threads:

Improve quality and usability of collected data

7666
dask/dask#10084
7677
7938
7671
7672
7675
7676
7678
7681

What we do with the data

7667
7668
7679
- 7831
- 7832
- 7848
- 7875
- 7893
- 7908
- 7910
- 7911
7680
7776
- 7787
- 7825
- 7790
- Other use cases of overlapping computations?

Finishing touches

7673
7674

crusaderky commented 1 year ago

Current data schema

7701 (currently in main git tip; will be in release 2023.4.1) introduces a new mapping `Scheduler.cumulative_worker_metrics`.

These are ever-increasing key->float amount pairs.

The keys are as follows. Note that there may be (there are) more keys than the ones listed below, and while all keys listed below are tuples, some keys may be bare strings.

Worker.execute() metrics

Format

("execute", <task prefix>, <activity>, <unit>) -> float amount

All metrics with the same unit are additive. In a hypothetical, perfect scenario where all workers run tasks back to back non-stop, they would add up to the number of threads on the cluster multiplied by the cluster uptime (although seceded tasks will mess that up; read #7675). Metrics are captured upon task termination so, in case of long-running tasks, if you scrape them frequently you may observe artificial spikes that exceed your scraping interval (read #7677).

Individual labels

Deserialize run_spec

("execute", <prefix>, "deserialize", "seconds")

Unspill inputs

("execute", <prefix>, "disk-read", "seconds")
("execute", <prefix>, "disk-read", "count")
("execute", <prefix>, "disk-read", "bytes")
("execute", <prefix>, "decompress", "seconds")
("execute", <prefix>, "deserialize", "seconds") (overlaps with run_spec deserialization)

Run task in thread

("execute", <prefix>, "executor", "seconds")
("execute", <prefix>, "thread-cpu", "seconds")
("execute", <prefix>, "thread-noncpu", "seconds")
("execute", <prefix>, <arbitrary user-defined label>, <arbitrary user-defined unit>)

Spill output (will disappear after #4424)

("execute", <prefix>, "serialize", "seconds")
("execute", <prefix>, "compress", "seconds")
("execute", <prefix>, "disk-write", "seconds")
("execute", <prefix>, "disk-write", "count")
("execute", <prefix>, "disk-write", "bytes")

Delta to end-to-end runtime as seen from the worker state machine

("execute", "z", "other", "seconds")

Future additions

("execute", <prefix>, "thread-I/O", "seconds") (dask/dask#10084)
("execute", <prefix>, "offload", "seconds") (#7681)
("execute", <prefix>, "zict-offload", "seconds") (#7681 + #4424)
("execute", <prefix>, "re-execute", "seconds") (#7676)
("execute", "n/a", "paused", "seconds") (#7671)
("execute", "n/a", "constrained", "seconds") (#7671)
("execute", "n/a", "gather-dep", "seconds") (#7671)
("execute", "n/a", "idle", "seconds") (#7671)

Failed tasks

Time wasted on non-successful tasks. These metrics are instead of the time metrics listed above.

("execute", <prefix>, "failed", "seconds")
("execute", <prefix>, "cancelled", "seconds")

Worker.gather_dep metrics

Format

("gather-dep", <activity>, <unit>) -> float amount

All metrics with the same unit are additive. A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster. Metrics are captured upon termination of a gather_dep call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.

Individual labels

Worker.gather_dep() method

("gather-dep", "network", "seconds")
("gather-dep", "decompress", "seconds")
("gather-dep", "deserialize", "seconds")

Spill output (will disappear after #4424)

("gather-dep", "serialize", "seconds")
("gather-dep", "compress", "seconds")
("gather-dep", "disk-write", "seconds")
("gather-dep", "disk-write", "count")
("gather-dep", "disk-write", "bytes")

Delta to end-to-end runtime as seen from the worker state machine

("gather-dep", "other", "seconds")

Future additions

("gather-dep", "offload", "seconds") (#7681)

Failed transfers

Time wasted on non-successful transfers. These metrics are instead of the time metrics listed above.

("gather-dep", "busy", "seconds")
("gather-dep", "missing", "seconds")
("gather-dep", "failed", "seconds")
("gather-dep", "cancelled", "seconds")

Worker.get_data() metrics

Format

("get-data", <activity>, <unit>) -> float amount

All metrics with the same unit are additive. A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster. Metrics are captured upon termination of a get_data call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.

Individual labels

Unspill

("get-data", "disk-read", "seconds")
("get-data", "disk-read", "count")
("get-data", "disk-read", "bytes")
("get-data", "decompress", "seconds")
("get-data", "deserialize", "seconds")

Send over the network

("get-data", "serialize", "seconds")
("get-data", "compress", "seconds")
("get-data", "network", "seconds")

crusaderky commented 1 year ago

Summary from an offline meeting with @fjetter, @hendrikmakait and @ntabris :

We should give priority about showing the metrics we're already collecting over adding/refining metrics
"End-to-end" metrics are the most useful. Time series can give some additional insights but not that many more.
The inclination to expose everything to Prometheus as a go-to is more about the fact that it's easy to do so an less about the fact that the data is exposed as time series; actually in this instance time metrics could cause confusions due to #7677.
We could reuse the scheduler-side construct of "computation" to define what "end-to-end" means. Its data could straightforwardly be extracted with scheduler plugins (e.g. coiled's).
There is concern / confusion as to why it appears that sometimes there are overlapping computations. We need to investigate exactly how this happens and decide what to do when the heartbeat sends new metrics and there's more than one computation running.
From a UX perspective, it would be valuable to expose the metrics as a cluster total and then let the user drill down into individual computations.

dask / distributed

Fine performance metrics meta-issue #7665

XREFs

7217

7565

7601

7586

Improve quality and usability of collected data

7666

7677

7938

7671

7672

7675

7676

7678

7681

What we do with the data

7667

7668

7679

7831

7832

7848

7875

7893

7908

7910

7911

7680

7776

7787

7825

7790

Finishing touches

7673

7674

Current data schema

7701 (currently in main git tip; will be in release 2023.4.1) introduces a new mapping Scheduler.cumulative_worker_metrics.

Worker.execute() metrics

Format

Individual labels

Deserialize run_spec

Unspill inputs

Run task in thread

Spill output (will disappear after #4424)

Delta to end-to-end runtime as seen from the worker state machine

Future additions

Failed tasks

Worker.gather_dep metrics

Format

Individual labels

Worker.gather_dep() method

Spill output (will disappear after #4424)

Delta to end-to-end runtime as seen from the worker state machine

Future additions

Failed transfers

Worker.get_data() metrics

Format

Individual labels

Unspill

Send over the network

7701 (currently in main git tip; will be in release 2023.4.1) introduces a new mapping `Scheduler.cumulative_worker_metrics`.