dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

Improve metrics being exposed to prometheus #4686

Open jacobtomlinson opened 3 years ago

jacobtomlinson commented 3 years ago

I've been futzing around with monitoring Dask with prometheus a little lately (some blog posts are in the pipeline).

I wanted to open an issue to discuss exposing more metrics via the /metrics endpoint.

Currently we expose the following:

Given that it is common to run additional exporters such as the node-exporter I don't think we need to worry about system type metrics, but there are other things I can think of that might be nice to expose here. I would also love input from others if you can think of things.

fjetter commented 3 years ago

A few more low level things, thinking about some of the rather opaque warnings we're raising

fjetter commented 3 years ago

I'm wondering if there is a demand for having some kind of integration to worker/scheduler plugins as well

randerzander commented 3 years ago

For Dask GPU workloads, we're often bottlenecked by storage throughput and/or latency. This is especially true in cloud deployments.

Having Dask expose storage related task throughput (e.g. for read_csv, write_parquet, etc) would be extremely helpful.

Given that it is common to run additional exporters such as the node-exporter I don't think we need to worry about system type metrics

node-exporter for system metrics is useful if the storage system is running locally. But for reading from HDFS, S3, GCS, ADLFS, etc it's not useful, and workflow developers are left without a view into how storage perf impacts overall workflow perf (unless they manually monitor with custom code).

randerzander commented 3 years ago

Currently it's difficult to know exactly which tasks induce an OOM failure.

It's common to proceed through a workflow doing something like iteratively adding print(f'got here 1 {len(ddf)}') at various places to try inducing failure as a diagnostic. This is painful.

It would be nice to expose task-level peak memory usage (both CPU and GPU) which would help avoid the above experience.