MPS with Kubernetes on NVIDIA GPU

selinnilesy commented 1 year ago

What would be the best way to enable Multi-Process Sharing on NVIDIA GPUs for Kubernetes pods? (such as an example yaml file for deploying K8s Pods on TITAN Rtx)

I am wondering if there is a recent support by NVIDIA on this.

Thanks in advance.

tunahanertekin commented 1 year ago

Idk if the official device plugin will support MPS (conclusion of this blog says it will) but I was able to test MPS using Nebuly's fork.

selinnilesy commented 1 year ago

After installing the plugin in this fork, nebuly-nvidia device plugin pod gets stuck at init state with the log that states the same thing.

This is for description of my healthy environment (that can work well with default device plugin but not with mps), right before installation node.txt pods.txt

klueska commented 1 year ago

We do not currently have plans to support it in the traditional device plugin, but we plan on supporting it with the new resource management API called DRA as shown in in this demo:

https://drive.google.com/file/d/1sU1ZhY4zNKBtXAeVx3sozHT0PuDz4-Qf/view?usp=drive_link

selinnilesy commented 1 year ago

Thank you Kevin, I have installed the NVIDIA DRA too, but I am afraid you don't support the CRDs for deploying mps sharing yet.

dominicshanshan commented 1 year ago

@klueska , could you provide a detailed product page or github repository for DRA? or the product roadmap?

prattcmp commented 10 months ago

Would like to see this sooner rather than later. Seems pretty fundamental

ettelr commented 10 months ago

Yes, we would also like to hear about this mps- k8s integration ASAP

elezar commented 10 months ago

This is related to the discussion on #467. We are active investigating adding MPS support to the plugin.

ettelr commented 10 months ago

@elezar So roadmap is to add it to official device plugin or support this as part of DRA as @klueska wrote above ? (or both of them will get to production ready stage)

elezar commented 10 months ago

@ettelr we are working on adding support to the device plugin directly. It will also be included as part of DRA once that is released as a production-ready option.

klueska commented 9 months ago

We just released an RC for the next version of the k8s-device-plugin with support for MPS: https://github.com/NVIDIA/k8s-device-plugin/tree/v0.15.0-rc.1?tab=readme-ov-file#with-cuda-mps

We would appreciate people to try this out and give any feedback you have before the final release in a few weeks.

jayground8 commented 8 months ago

@klueska I tested MPS support with v0.15.0-rc.2. It says "Explicitly set sharing.mps.failRequestsGreaterThanOne = true" in the release notes. Is it possible to set sharing.mps.failRequestsGreaterThanOne to false? It didn't work when I tried with config file like below.

sharing:
  mps:
    failRequestsGreaterThanOne: false
    resources:
    - name: nvidia.com/gpu
      replicas: 4

If failedRequestsGreaterThanOne sets to false, does it mean that a pod can utilize n times fraction of the resources when its gpu requests is set to n?

Furthermore, each of these resources -- either nvidia.com/gpu or nvidia.com/gpu.shared -- would have access to the same fraction (1/10) of the total memory and compute resources of the GPU.

Is it also a way to implement MPS to only specific GPU server nodes and not others?

ettelr commented 7 months ago

@klueska I tested MPS support with v0.15.0-rc.2. It says "Explicitly set sharing.mps.failRequestsGreaterThanOne = true" in the release notes. Is it possible to set sharing.mps.failRequestsGreaterThanOne to false? It didn't work when I tried with config file like below.
sharing:
  mps:
    failRequestsGreaterThanOne: false
    resources:
    - name: nvidia.com/gpu
      replicas: 4
If failedRequestsGreaterThanOne sets to false, does it mean that a pod can utilize n times fraction of the resources when its gpu requests is set to n?

Furthermore, each of these resources -- either nvidia.com/gpu or nvidia.com/gpu.shared -- would have access to the same fraction (1/10) of the total memory and compute resources of the GPU.

Is it also a way to implement MPS to only specific GPU server nodes and not others?

Also interested in this one, once I set the device plugin sharing config do I have a way to request different amount of memory for each pod ? Is it by calculation the amount of shared I need and specifying in requests more than one? Any other idea how to achieve this?

anxolerd commented 7 months ago

Using the stable version of 0.15.0 and I am also interested in MPS with failRequestsGreaterThanOne=false so that I ca allocate different amount of GPU memory to different processes

ettelr commented 7 months ago

or enabling requesting more than one shared device , or alternatively set the slice side and let the device plugin create it dynamically will be super useful setting the static amount of equal slices in configuration makes this not scalable enough is there anything like this in roadmap?

klueska commented 7 months ago

We plan to include support for MIG devices in the next release, but we do not have immediate plans to relax the failRequestsGreaterThanOne=true constraint.

The challenge being that the device plugin is not the one that does the actual allocation of devices to the pod (the kubelet does). Meaning that there is no way to ensure that if you are given 2 MPS "replicas" that they come from the same underlying GPU.

We thought about leveraging the GetPreferredAllocation() call to support this, but then you run into issues with fragmentation. For example, you request 2 replicas, but the only two replicas available are from separate GPUs. The scheduler won't recognize that and will happily schedule your pod to the node. Once on the node though, the allocation will fail and your pod will be stuck.

All of that said, it would be possible to make this work in cases where you only have a single GPU on the node, and we may consider doing that in the next release.

However, our focus is really on providing comprehensive support for MPS using DRA: https://youtu.be/1QfShSQLsbs?si=LJ_f8UfRjiqcl2Pd&t=1080

Adding MPS support to the existing device plugin is really just a stop gap until DRA is stable / usable.

ettelr commented 7 months ago

We plan to include support for MIG devices in the next release, but we do not have immediate plans to relax the failRequestsGreaterThanOne=true constraint.

The challenge being that the device plugin is not the one that does the actual allocation of devices to the pod (the kubelet does). Meaning that there is no way to ensure that if you are given 2 MPS "replicas" that they come from the same underlying GPU.

We thought about leveraging the GetPreferredAllocation() call to support this, but then you run into issues with fragmentation. For example, you request 2 replicas, but the only two replicas available are from separate GPUs. The scheduler won't recognize that and will happily schedule your pod to the node. Once on the node though, the allocation will fail and your pod will be stuck.

All of that said, it would be possible to make this work in cases where you only have a single GPU on the node, and we may consider doing that in the next release.

However, our focus is really on providing comprehensive support for MPS using DRA: https://youtu.be/1QfShSQLsbs?si=LJ_f8UfRjiqcl2Pd&t=1080

Adding MPS support to the existing device plugin is really just a stop gap until DRA is stable / usable.

and what about creating the slices dynamically with different sizes according to pod's requests. some convention like requesting 1/2 or 1/4 GPU

elezar commented 7 months ago

and what about creating the slices dynamically with different sizes according to pod's requests. some convention like requesting 1/2 or 1/4 GPU

Extended resources (as exposed by the Device Plugin API) only represent countable units and don't allow fractional allocations. With that said, we have discussed more flexible partitioning under MPS, but this would not be dynamic in the case of the device plugin.

ettelr commented 7 months ago

and what about creating the slices dynamically with different sizes according to pod's requests. some convention like requesting 1/2 or 1/4 GPU

Extended resources (as exposed by the Device Plugin API) only represent countable units and don't allow fractional allocations. With that said, we have discussed more flexible partitioning under MPS, but this would not be dynamic in the case of the device plugin.

GPU memory is countable, we just want to be able to request different amount of GPU memory for each workloads, since we have a big system with alot of GPUs and different types of workloads requesting different GPU memory

klueska commented 7 months ago

I'm open to suggestions on how to support this, but I don't know how to do it (generally) with the existing device-plugin API and the default kubernetes scheduler. The scheduler simply doesn't have enough information to know what portion of the capacity on each node comes from the same underlying GPU (vs. multiple GPUs).

One thing we've considered is introducing a config option that would advertise each individual GPU as its own resource (i.e. nvidia.com/gpu-0, nvidia.com/gpu-1, etc.). With something like that in place, we'd be able to provide what you are looking for, but it would put an extra burden on the user to have to request a specific GPU index rather than a generic nvidia.com/gpu resource.

ettelr commented 7 months ago

I'm open to suggestions on how to support this, but I don't know how to do it (generally) with the existing device-plugin API and the default kubernetes scheduler. The scheduler simply doesn't have enough information to know what portion of the capacity on each node comes from the same underlying GPU (vs. multiple GPUs).

One thing we've considered is introducing a config option that would advertise each individual GPU as its own resource (i.e. nvidia.com/gpu-0, nvidia.com/gpu-1, etc.). With something like that in place, we'd be able to provide what you are looking for, but it would put an extra burden on the user to have to request a specific GPU index rather than a generic nvidia.com/gpu resource.

giving example if I have 2 nodes with each 2 80GB GPUs each node can have in capacity instead of stable number of gpu.shared: nvidia.com/gpu-10gb:8 nvidia.com/gpu-20gb:4 nvidia.com/gpu-40gb:2 nvidia.com/gpu-80gb:1

The used can pre configure all the options he wants and of course the overlapping counts need to be taken into consideration somehow - is it feasible ?

klueska commented 7 months ago

You can't have overlapping resources using the traditional device plugin. It's the whole reason that MIG requires static partitioning, for example.

ettelr commented 7 months ago

You can't have overlapping resources using the traditional device plugin. It's the whole reason that MIG requires static partitioning, for example.

ok we can start from static partitioning here as well just being able to have few sizes for MPS with preconfiguration of them until you have a better idea

ettelr commented 7 months ago

What about using a whole GPU side by side with the shred once? assumeing I turn on renameByDefault, Is there an future option where we can use the whole gpus if the MPS sharing is on?

anxolerd commented 7 months ago

@klueska , static partitioning like with mig is a good enough for start. It would be great, if you could implement that.

Also it would be nice if we had a possibility to use real cards alongside with shared on the same machine. Static configuration is also would be a good enough start for that.

zr-idol commented 3 months ago

@klueska Is there any way in the meantime to request more than 1 replica from each GPU in my node?

wei1793786487 commented 3 months ago

@klueska Is there any way in the meantime to request more than 1 replica from each GPU in my node?

I also want to know about this issue

NVIDIA / k8s-device-plugin

MPS with Kubernetes on NVIDIA GPU #443