Open phobson opened 2 years ago
One big part of this (I think) is that Coiled doesn't support multiple workers on a single VM/instance. See https://github.com/coiled/product/issues/7 for some discussion.
Maybe the GPU use-case bumps up the priority of multi-worker instances (or maybe not).
RAPIDS docs now have some info about partitioning GPUs: https://docs.rapids.ai/deployment/nightly/mig.html
This seems like a thing we could do, but I'd like more signal that this is a thing that would be worth the effort (i.e., there's non-trivial demand for this).
For certain workloads, the optimal cluster will have multiple workers on a single GPU. This currently isn't possible in Coiled.