coiled / feedback

A place to provide Coiled feedback
15 stars 3 forks source link

Scheduler GPU requirement #236

Open jacobtomlinson opened 1 year ago

jacobtomlinson commented 1 year ago

Just a heads up that with recent changes in distributed (https://github.com/dask/distributed/pull/7564) a GPU is now mandatory on the scheduler if the client/workers have GPUs.

However, it can be a lesser GPU provided that it has compatible CUDA compute capabilities (RAPIDS needs CCC 6.0+, other libraries may vary). So I can see folks configuring workers with A100s and schedulers with T4s to optimize cost.

Currently, the scheduler_gpu kwarg in coiled.Cluster is a boolean and in theory you could set worker_gpu=1,scheduler_gpu=False which will break things when trying to use that cluster going forwards.

I would suggest that if worker_gpu is set then scheduler_gpu must be set to True, so maybe that kwarg should be removed altogether.

It would be nice to add a new argument called scheduler_gpu_type instead so that users could set something like worker_gpu=1, worker_gpu_type="nvidia-tesla-a100", scheduler_gpu_type="nvidia-tesla-t4".

ntabris commented 1 year ago

Thanks! @fjetter also gave me a heads up about this.

Assuming you're using the cluster in normal ways (not, e.g., using scheduler as a notebook host), is there any reason that T4 wouldn't be good enough? Our scheduler_gpu kwarg always adds T4.

I very much look forward to the public docs where this is explained! People will definitely want to understand this requirement.

jacobtomlinson commented 1 year ago

is there any reason that T4 wouldn't be good enough?

I can only speak for RAPIDS, but a T4 on the scheduler is probably going to be a good bet for the majority of users so making it the default would be totally reasonable. The H100 and L4 are on the horizon though so I expect once those are generally available it would be common to pair a T4 with V100/A100 and an L4 with H100 due to the CUDA compute capability compatibility (what a mouthful).

I'm not sure whether there would be implications with pairing a T4 with an H100. But we can worry about that later.

As you say if you set jupyter=Trueyou might want the scheduler GPU to also be high-end, but not necessarily. I could imagine folks setting n_workers=0,jupyter=True to get a T4 for some initial lower-performance interactive exploration of subsets of data, then calling cluster.scale(n) when it's time to use the full dataset and really push things.

I very much look forward to the public docs where this is explained! People will definitely want to understand this requirement.

Right now we have this page which is useful but will need updating with the new harder requirements. @fjetter, @rjzamora and I are also drafting up a blog post which will go on https://blog.dask.org announcing this change and some docs for distributed to go with it. I expect these should be available with the Dask 2023.4.0 release.