dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Observed worker network bandwidth chart #5090

Closed mrocklin closed 3 years ago

mrocklin commented 3 years ago

We have a couple of charts for inter-worker bandwidth that use our own metrics. I don't trust these metrics that much. We also keep per-worker metrics. These are currently reported in the /workers tab with read/write at the end of the table. I think that these are recorded with psutil and are sent in the worker heartbeats. We might want to create another real-time bar-chart like occupancy/cpu/memory but for read/write traffic.

This is slightly more complex because there are a couple of different values to show for each worker. As options I could imagine ...

  1. Summing these up
  2. Overplotting both with colors that make it clear which one is ahead
  3. Plotting two skinnier rectangles per worker
fjetter commented 3 years ago

I would actually love to see the measured bandwidths as a timeseries plot since they vary quite a bit over time, but I'm not sure how feasible this is

mrocklin commented 3 years ago

It's pretty feasible. We would copy the /system charts.

Personally, I'd love both :)

On Mon, Jul 19, 2021 at 1:18 AM Florian Jetter @.***> wrote:

I would actually love to see the measured bandwidths as a timeseries plot since they vary quite a bit over time, but I'm not sure how feasible this is

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/5090#issuecomment-882346862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTEZMJJGTOFICRMENVTTYPNWZANCNFSM5ASI6J3Q .

mrocklin commented 3 years ago

For context we already have this data rendered in the workers chart here:

image

(see the last two columns, the top row is the sum)

it would be good to have this same information rendered much in the same way we do for cpu/occupancy/nbytes charts for realtime status

image

As well as a timeseries plot, much like how we do the /system charts

image

Personally, for the realtime chart I would just copy the occupancy/cpu/... charts, swap out CPU stats for network io stats (consulting the /workers chart to see where those are coming from) and then maybe that's it.

For the timeseries chart I would probably just plot the total bandwidth, rather than one value for every worker. If we wanted to be clever we could have both a left and right axis for total and average bandwidth respectively.

ncclementi commented 3 years ago

After looking into this a bit more I have some questions/ comments. For the first plot (Nbytes looking one)

For example, if I have 2 workers I would have to horizontal bars that each one shows read_bytes and write_bytes(where this data comes from ws.metrics.read_bytes and ws.metrics.write_bytes)

For the times series chart, I'm not sure if I'm understanding this correctly: We want a time series per worker that has the total bandwidth and this is would be the sum of ws.metrics.read_bytes and ws.metrics.write_bytes, is this correct?

mrocklin commented 3 years ago

Is the idea to have read and write per worker overlapped with different colors and some alpha?

Naty and I discussed this briefly we decided to go with multi-bar plots like the following

image

mrocklin commented 3 years ago

Although I think in our situation we should skip the whitespace. It would be good if these bars corresponded with the other horizontal bar charts so that people can compare CPU use against bandwidth.

mrocklin commented 3 years ago

I'm going to reopen this so that we continue to track the timeseries option (which I think would be valuable)

jrbourbeau commented 3 years ago

Whoops, thanks for re-opening