dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 720 forks source link

Fine performance metrics: Break down idle time on the Worker #7671

Open crusaderky opened 1 year ago

crusaderky commented 1 year ago

As of #7586, the sum of Worker.digests_total[("execute", *, *, "seconds")] is equal to the time spent executing tasks, multiplied by the number of threads on the worker.

There's a big chunk of extra time that is not counted, which is:

Worker.digests_total["execute", "n/a", "idle", "seconds"]  = max(0, 
    (number of threads - number of tasks in executing/cancelled/resumed state) * T
    - paused time
    - constrained time
    - gather time
))

With the above additions, the sum of Worker.digests_total[("execute", *, *, "seconds")] should accumulate to (number of threads * worker uptime) by construction when there are no tasks currently running.

The above formulas are a very quick draft and should be reviewed for correctness.

crusaderky commented 1 year ago

https://github.com/dask/distributed/pull/7938 introduces a metric on the spans that measures the total of the idle time. This is superior to just beaming the same information from the workers as it is aware of tasks running everywhere on the scheduler. This issue should be reviewed/redesigned in light of this.