Open crusaderky opened 1 year ago
Scheduler.cumulative_worker_metrics
.These are ever-increasing key->float amount pairs.
The keys are as follows. Note that there may be (there are) more keys than the ones listed below, and while all keys listed below are tuples, some keys may be bare strings.
("execute", <task prefix>, <activity>, <unit>) -> float amount
All metrics with the same unit are additive. In a hypothetical, perfect scenario where all workers run tasks back to back non-stop, they would add up to the number of threads on the cluster multiplied by the cluster uptime (although seceded tasks will mess that up; read #7675). Metrics are captured upon task termination so, in case of long-running tasks, if you scrape them frequently you may observe artificial spikes that exceed your scraping interval (read #7677).
("execute", <prefix>, "deserialize", "seconds")
("execute", <prefix>, "disk-read", "seconds")
("execute", <prefix>, "disk-read", "count")
("execute", <prefix>, "disk-read", "bytes")
("execute", <prefix>, "decompress", "seconds")
("execute", <prefix>, "deserialize", "seconds")
(overlaps with run_spec deserialization)("execute", <prefix>, "executor", "seconds")
("execute", <prefix>, "thread-cpu", "seconds")
("execute", <prefix>, "thread-noncpu", "seconds")
("execute", <prefix>, <arbitrary user-defined label>, <arbitrary user-defined unit>)
("execute", <prefix>, "serialize", "seconds")
("execute", <prefix>, "compress", "seconds")
("execute", <prefix>, "disk-write", "seconds")
("execute", <prefix>, "disk-write", "count")
("execute", <prefix>, "disk-write", "bytes")
("execute", "z", "other", "seconds")
("execute", <prefix>, "thread-I/O", "seconds")
(dask/dask#10084)("execute", <prefix>, "offload", "seconds")
(#7681)("execute", <prefix>, "zict-offload", "seconds")
(#7681 + #4424)("execute", <prefix>, "re-execute", "seconds")
(#7676)("execute", "n/a", "paused", "seconds")
(#7671)("execute", "n/a", "constrained", "seconds")
(#7671) ("execute", "n/a", "gather-dep", "seconds")
(#7671)("execute", "n/a", "idle", "seconds")
(#7671)Time wasted on non-successful tasks. These metrics are instead of the time metrics listed above.
("execute", <prefix>, "failed", "seconds")
("execute", <prefix>, "cancelled", "seconds")
("gather-dep", <activity>, <unit>) -> float amount
All metrics with the same unit are additive. A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster. Metrics are captured upon termination of a gather_dep call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.
("gather-dep", "network", "seconds")
("gather-dep", "decompress", "seconds")
("gather-dep", "deserialize", "seconds")
("gather-dep", "serialize", "seconds")
("gather-dep", "compress", "seconds")
("gather-dep", "disk-write", "seconds")
("gather-dep", "disk-write", "count")
("gather-dep", "disk-write", "bytes")
("gather-dep", "other", "seconds")
("gather-dep", "offload", "seconds")
(#7681)Time wasted on non-successful transfers. These metrics are instead of the time metrics listed above.
("gather-dep", "busy", "seconds")
("gather-dep", "missing", "seconds")
("gather-dep", "failed", "seconds")
("gather-dep", "cancelled", "seconds")
("get-data", <activity>, <unit>) -> float amount
All metrics with the same unit are additive. A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster. Metrics are captured upon termination of a get_data call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.
("get-data", "disk-read", "seconds")
("get-data", "disk-read", "count")
("get-data", "disk-read", "bytes")
("get-data", "decompress", "seconds")
("get-data", "deserialize", "seconds")
("get-data", "serialize", "seconds")
("get-data", "compress", "seconds")
("get-data", "network", "seconds")
Summary from an offline meeting with @fjetter, @hendrikmakait and @ntabris :
XREFs
7217
7565
7601
7586
In #7586, we started collecting very granular metrics on how workers are spending their time. Demo: https://gist.github.com/crusaderky/a97f870c51260e63a1c14c20b762f666
As of that PR, we collect metrics in
Worker.digests_total
about:Worker.execute
, broken down by task prefix and activity, with special treatment for failed and cancelled tasksWorker.gather_dep
, broken down by activity, with special treatment for failed and cancelled transfersWorker.get_data
, broken down by activityWorkerMemoryMonitor._spill
, broken down by activityThis issue is a meta-tracker of all potential follow-ups, as well as a place to discuss high level design and cost/benefit ratios holistically.
The follow-ups can be broken down into two high level threads:
Improve quality and usability of collected data
7666
7677
7938
7671
7672
7675
7676
7678
7681
What we do with the data
7667
7668
7679
7831
7832
7848
7875
7893
7908
7910
7911
7680
7776
7787
7825
7790
Finishing touches
7673
7674