dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.55k stars 712 forks source link

Add Prometheus metrics that tracks cumulative counts of handler calls #8696

Open hendrikmakait opened 1 week ago

hendrikmakait commented 1 week ago

To make distributed more observable and allows users to see what's happening under the hood, it would be nice to count handler calls and expose those counts as Prometheus metrics. This would help understand various scenarios, e.g., frequent who_has calls if data goes missing.

fjetter commented 1 week ago

We should keep an eye onto cardinality here if we exposed all the stream handlers. Right now, I'm counting about 80 on the scheduler excluding extensions. So if we exposed all handlers that'd be easily ~100 counters

I guess the most valuable would be to instrument the scheduler so having this once per cluster might be fine but having this for an entire cluster might be too much

@ntabris any strong reactions to these numbers?

hendrikmakait commented 1 week ago

We should keep an eye onto cardinality here if we exposed all the stream handlers. Right now, I'm counting about 80 on the scheduler excluding extensions. So if we exposed all handlers that'd be easily ~100 counters

Do we know if it's possible (and common practice) to filter Prometheus metrics upon collection by dimensions? I think it would be nice if we could expose metrics for all handlers (more importantly, I'm also talking about worker handlers where we'd also see an explosion due to the worker count) and let the user decide which handlers are important to them.

fjetter commented 1 week ago

IIUC scrape_config allows one to only scrape parts of the metrics but I would prefer to not make it too difficult for users to hook up with prometheus.

hendrikmakait commented 1 week ago

IIUC scrape_config allows one to only scrape parts of the metrics but I would prefer to not make it too difficult for users to hook up with prometheus.

If the main goal is not to make it too difficult for users to hook up Dask with Prometheus, my suggestion is to provide either a scrape_config with sane defaults or to implement some customization mechanism that allows users to configure which metric they want to Dask to expose in the first place. The latter should also choose sane defaults.

We don't know which handlers will be useful, so I wouldn't want to limit what users can see artificially but rather give them the ability to see everything and pick and choose themselves.