NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.72k stars 614 forks source link

Dedicated GPU's for time slicing on multi GPU set ups. #628

Open joe-schwartz-certara opened 6 months ago

joe-schwartz-certara commented 6 months ago

I'm wondering if there is a simple way to set up a configuration for dedicating a single GPU on a multi-GPU system to time-slicing. For example, my use case is that I have some services which are critical and some which are not and I want to use time-slicing for the non-critical services and leave dedicated GPUs to the critical services.

It seems like this plugin is close to allowing that and I was expecting something like

version: v1 sharing: timeSlicing: renameByDefault: true resources:

to select ten time-slicing replicas for the 0-th GPU as indexed via nvidia-smi and then I would request resources to pods via either nvidia.com/gpu.shared (for non-dedicated GPU usage on the 0-th GPU) or nvidia.com/gpu (for dedicated GPU usage). Is this kind of fine-grained control planned for the future or is there something simple I can do to route only some of the hardware thru the sharing part of the plugin?

frittentheke commented 3 months ago

I have the exact same question. Looking at the code doing the timeSlicing (https://github.com/NVIDIA/k8s-device-plugin/commit/a7c5dcf6091495da2b83d4dd5a6125b620f04d3f) it's possible to define devices via GPU index, GPU UUID or even MIG UUID.

But apparently device selection is currently "disabled" via https://github.com/NVIDIA/k8s-device-plugin/blob/35ad18080eded1889dc1eaee1132debddfd6757c/api/config/v1/config.go#L89

This restriction was there from the beginning, if you look at https://github.com/NVIDIA/k8s-device-plugin/blame/b9fe486d8b7c581e1b144ea31f0d6f6173668601/cmd/gpu-feature-discovery/main.go#L276 when the code was copied over from https://github.com/NVIDIA/gpu-feature-discovery/blob/152fa93619e973043d936f19bf20bb465c1ab289/cmd/gpu-feature-discovery/main.go#L276

@elezar @ArangoGutierrez @tariq1890 since you contributed (to) this code, may I kindly ask you to elaborate if adding the capability to "only" do timeSlicing / create replicas for a subset of GPU or MIG instances?

I myself would love to partition all my GPUs via MIG, but only also enable timeSlicing on the MIG instances of the first two. Not being able to filter whole GPUs is even worse as this requires all GPUs in a machine to either do time-slicing or not (via node-specific config).

joe-schwartz-certara commented 2 months ago

@frittentheke I am still bouncing around ideas on how to do the kind of fine-grained GPU access control that you and I both need. I discovered that you can override the envvar assignment from the plugin by just setting

        - name: NVIDIA_VISIBLE_DEVICES
          value: <comma separated list of the exact GPU uuids that you want the pod to use>

And if you use the same uuid(s) for 2 different pod specs, the applications will share the gpu selected with no problems. My lack of problems for this oversubscription method w/o using time slicing is probably due to the nature of the applications im running (they both claim all the VRAM they will need as soon as they start up) but I still am worried that this deployment strategy has some unknown issues since I'm basically just ignoring the plugin entirely.

As has been mentioned before: https://docs.google.com/document/d/1BNWqgx_SmZDi-va_V31v3DnuVwYnF2EmN7D-O_fB6Oo/edit#heading=h.bxuci8gx6hna This feature does exactly what we want. But we have to wait...

I will also comment that another, hacky workaround is to use the 'whole' GPU mig partitions (i.e. on an 80gb a100, nvidia.com/mig-7g.80gb is the whole gpu), set only some of the node's gpus to the 'whole' partition, and then select only those that are partitioned for time-slicing. I still foresee problems if you need even more fine control i.e. where you have application a,b,c, and d and a+b can share a gpu, as well as c+d, but c+a cannot (a scenario where application a,c are large gpu requirements but b, d are small). The way k8 routing works, you cannot assure that your resources will get allocated as a+b and c+d instead of some other combination.

frittentheke commented 2 months ago

I suppose the rather new Dynamic Resource Allocation (https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) is what will somewhat solve this issue of dedicated resources to be claimed by workload. NVIDIA apparently is working on a driver for their GPU resources: https://github.com/NVIDIA/k8s-dra-driver