Open jacobtomlinson opened 1 year ago
Thanks! @fjetter also gave me a heads up about this.
Assuming you're using the cluster in normal ways (not, e.g., using scheduler as a notebook host), is there any reason that T4 wouldn't be good enough? Our scheduler_gpu
kwarg always adds T4.
I very much look forward to the public docs where this is explained! People will definitely want to understand this requirement.
is there any reason that T4 wouldn't be good enough?
I can only speak for RAPIDS, but a T4 on the scheduler is probably going to be a good bet for the majority of users so making it the default would be totally reasonable. The H100 and L4 are on the horizon though so I expect once those are generally available it would be common to pair a T4 with V100/A100 and an L4 with H100 due to the CUDA compute capability compatibility (what a mouthful).
I'm not sure whether there would be implications with pairing a T4 with an H100. But we can worry about that later.
As you say if you set jupyter=True
you might want the scheduler GPU to also be high-end, but not necessarily. I could imagine folks setting n_workers=0,jupyter=True
to get a T4 for some initial lower-performance interactive exploration of subsets of data, then calling cluster.scale(n)
when it's time to use the full dataset and really push things.
I very much look forward to the public docs where this is explained! People will definitely want to understand this requirement.
Right now we have this page which is useful but will need updating with the new harder requirements. @fjetter, @rjzamora and I are also drafting up a blog post which will go on https://blog.dask.org announcing this change and some docs for distributed to go with it. I expect these should be available with the Dask 2023.4.0
release.
Just a heads up that with recent changes in distributed (https://github.com/dask/distributed/pull/7564) a GPU is now mandatory on the scheduler if the client/workers have GPUs.
However, it can be a lesser GPU provided that it has compatible CUDA compute capabilities (RAPIDS needs CCC 6.0+, other libraries may vary). So I can see folks configuring workers with A100s and schedulers with T4s to optimize cost.
Currently, the
scheduler_gpu
kwarg incoiled.Cluster
is a boolean and in theory you could setworker_gpu=1,scheduler_gpu=False
which will break things when trying to use that cluster going forwards.I would suggest that if
worker_gpu
is set thenscheduler_gpu
must be set toTrue
, so maybe that kwarg should be removed altogether.It would be nice to add a new argument called
scheduler_gpu_type
instead so that users could set something likeworker_gpu=1, worker_gpu_type="nvidia-tesla-a100", scheduler_gpu_type="nvidia-tesla-t4"
.