dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.56k stars 715 forks source link

Repeated calls to `memory_color` take around 12% of CPU time of scheduler #8763

Open jonded94 opened 1 month ago

jonded94 commented 1 month ago

Describe the issue:

Some follow-up to: https://github.com/dask/distributed/issues/8761

After fixing above issue already in https://github.com/dask/distributed/pull/8762, the next big thing that takes very much CPU power with a scheduler with lots of workers (>2000), are the calls to _cluster_memory_color, more specifically _memory_color.

https://github.com/dask/distributed/blob/782050a3a4cf2abd450caa8adfaa912c22829e78/distributed/dashboard/components/scheduler.py#L391

As far as I can see, this is about coloring the memory bar of a specific worker depending if it's deemed "good", "almost full" or "full".

Again, speedscope stuff (this was without the fix from PR 8762):

image

speedscope.json

Is this something that could be solved by binning the memory load & size (surely coloring doesn't have to be so exact that is has to be based on exact bytes of memory) and caching the result of this memory coloring process too?

Surely, one don't has to recalculate which color a worker process with for example 1024/4096MiB RAM shall have hundreds of times per second, especially since the coloring result doesn't change at all.

Environment:

jonded94 commented 1 month ago

Ah, so what takes time here is asking the scheduler what the current memory situation across all workers is. This also happens on the next biggest call, taking around 7.0% CPU time: https://github.com/dask/distributed/blob/782050a3a4cf2abd450caa8adfaa912c22829e78/distributed/dashboard/components/scheduler.py#L413

So there are at least two separate calls which result in measuring the total cluster memory situation, is that correct? Is that something that one could cache at least per tick or even calculate only one every 10 ticks or so?

jonded94 commented 1 month ago

Other tangential question: Can the tick rate of the dashboard be adjusted somehow? I would be totally fine with an update every 10s or so, I'm mostly using internal Grafana dashboards fueled from the Prometheus metrics anyways. On top of that, we have very long running tasks, so frequent updates are not even helping.

jonded94 commented 1 month ago

There is --no-show for entirely disabling the dashboard, but this parameter is apparently unused :(

https://github.com/dask/distributed/blob/782050a3a4cf2abd450caa8adfaa912c22829e78/distributed/cli/dask_scheduler.py#L127

Using --no-dashboard will disable metrics too, which is undesirable :/ We'd want prometheus metrics that we can look at asynchronously.

fjetter commented 1 month ago

Other tangential question: Can the tick rate of the dashboard be adjusted somehow? I would be totally fine with an update every 10s or so, I'm mostly using internal Grafana dashboards fueled from the Prometheus metrics anyways. On top of that, we have very long running tasks, so frequent updates are not even helping.

This is not currently exposed. It's also a bit messy since the update interval for every widget is controlled individually, see for example https://github.com/dask/distributed/blob/8564dc79a1b9902eb0320d51b034e0623a2afe8b/distributed/dashboard/scheduler.py#L74-L116

However, some docs are hard coding this, e.g. system_monitor here https://github.com/dask/distributed/blob/8564dc79a1b9902eb0320d51b034e0623a2afe8b/distributed/dashboard/components/scheduler.py#L4543-L4551

Using --no-dashboard will disable metrics too, which is undesirable :/ We'd want prometheus metrics that we can look at asynchronously.

No, metrics are still available. At least when I try they are. You can check by opening http://localhost:<dashboard_port>/metrics

jonded94 commented 1 month ago

This is not currently exposed. It's also a bit messy since the update interval for every widget is controlled individually

Hm, would be nice if that could be controlled somehow, but yeah, one would have to spend a bit of time here to make all of these values configurable, right? Maybe one could include a single configurable parameter that we'd multiply/divide all of these update intervals by?

No, metrics are still available.

Ah, sorry! Just started a pipeline with --no-dashboard, the metrics indeed seem to be there. I recall that this was the case for the worker dashboards, it seemed metrics only became available when you'd enable worker dashboards.

I can keep you updated as soon as I start a really large pipeline again what the scheduler health then looks like and where it's spending a lot of compute.

jonded94 commented 1 month ago

Hey @fjetter ! We tried a bunch of even larger scale workloads (>3k workers), tried to optimize our stack a bit by not submitting too many tasks at once, disabled the dashboard (--no-dashboard) and got the clusters to be quite stable :)

With these measures, we saw scheduler tick durations of ~3-30ms and scheduler CPU load of ~5-35%, with latter being on the lower side more often.

That is quite okay for us right now. I have to say though that enabling the dashboard and actually using it/having it open in a browser will lead to the scheduler being 100% pegged almost immediately and also the dashboard loosing connection/not updating anymore quite often.

EDIT: Nevertheless I'd still be interested in either slowing down dashboard updates through a config or improving the performance of the compute-heavy parts of the scheduler. Let me know if you know a nice place to start improving something!