Improve metrics being exposed to prometheus

jacobtomlinson commented 3 years ago

I've been futzing around with monitoring Dask with prometheus a little lately (some blog posts are in the pipeline).

I wanted to open an issue to discuss exposing more metrics via the /metrics endpoint.

Currently we expose the following:

Scheduler
- Number of clients
- Number of workers and their states (connected, saturated, idle)
- Desired number of workers
- Forgotten tasks
- Tasks and their states (released, no-worker, waiting, memory, erred, processing)
Workers
- Tasks and their states (stored, executing, waiting, ready, serving)
- Number of connections to other workers
- Latency of worker connections
- Number of threads

Given that it is common to run additional exporters such as the node-exporter I don't think we need to worry about system type metrics, but there are other things I can think of that might be nice to expose here. I would also love input from others if you can think of things.

Managed vs unmanaged worker memory (based on the work @crusaderky has been doing in #4634/#4651)
Some proxy for the task stream data. Perhaps something like percentages of time spent processing, communicating, swapping, idle. Perhaps something like this from @Timost in #4260.

fjetter commented 3 years ago

task group/prefix based metrics could also be good candidates
big +1 for time spent doing {spilling,network,compute}. The only other way to get this is doing performance reports but having this realtime would be great
Some proxy for work stealing might also proof valuable. Something like a counter for the number of tasks stolen.
In rare situations, I could imagine "last_seen" for workers to be useful (low prio)

A few more low level things, thinking about some of the rather opaque warnings we're raising

We're issuing warnings about GC collect durations (once beyond a threshold). having this as a timeseries available might be helpful debugging/undestanding where the GC is active. We also expose the tick duration so this would not be highly unusual
Another warning is the "memory limit reached but no data to spill to disk" where #4634 would be a perfect addition

fjetter commented 3 years ago

I'm wondering if there is a demand for having some kind of integration to worker/scheduler plugins as well

randerzander commented 3 years ago

For Dask GPU workloads, we're often bottlenecked by storage throughput and/or latency. This is especially true in cloud deployments.

Having Dask expose storage related task throughput (e.g. for read_csv, write_parquet, etc) would be extremely helpful.

Given that it is common to run additional exporters such as the node-exporter I don't think we need to worry about system type metrics

node-exporter for system metrics is useful if the storage system is running locally. But for reading from HDFS, S3, GCS, ADLFS, etc it's not useful, and workflow developers are left without a view into how storage perf impacts overall workflow perf (unless they manually monitor with custom code).

randerzander commented 3 years ago

Currently it's difficult to know exactly which tasks induce an OOM failure.

It's common to proceed through a workflow doing something like iteratively adding print(f'got here 1 {len(ddf)}') at various places to try inducing failure as a diagnostic. This is painful.

It would be nice to expose task-level peak memory usage (both CPU and GPU) which would help avoid the above experience.

dask / distributed

Improve metrics being exposed to prometheus #4686