NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.69k stars 610 forks source link

Support sharing GPUs #169

Closed ktarplee closed 2 months ago

ktarplee commented 4 years ago

It would be useful to allow containers/pods to share GPUs (similar to a shared workstation) when desired.

I have a fork of this device plugin that implements the above functionality. One has to label nodes as either exclusive access or shared access to the GPUs. For shared access you must specify the number of replicas of the GPUs to create. For example, if you have 4 physical GPUs on a node and want to allow each GPU to be allocated twice, one would set the replicas to 2 so there are effectively 8 GPUs for k8s to schedule.

Is this something that would be of interest to this project as a pull request?

nvjmayo commented 4 years ago

I think you'd want some quotas and maybe QoS to use this feature in a production environment. Otherwise it seems like some poorly behaved application could hog GPU memory and deny other pods. Failed pods would likely start up but not run to completion.

Without support from the graphics driver for sharing gracefully. I'm a bit worried that the potential limitations won't be immediately obvious to end users, and that the real use case is pretty narrow. (maybe I'm wrong and you've dealt with quotas or have GPU memory pools?)

I'd love to take a look at your work or notes, or discuss it on the Kubernetes Slack.

ktarplee commented 4 years ago

It does not implement QoS or memory pools. It is more akin to how users use a shared server with GPUs but it does limit the oversubscription. So there is risk that you will clobber someone else, however it is unlikely when users are sporadically using the GPU (due to statistical multiplexing), for example when multiple pods want to share a GPU for model serving with infrequent requests. Besides, users opt-in to the shared GPUs by setting the nvidia.com/sharedgpu=1 (instead of nvidia.com/gpu where they would get exclusive access).

When the number of replicas for the GPUs is set to 1, the approach is equivalent to what you have currently (i.e. exclusive access). Supporting both nvidia.com/gpu (exclusive) and nvidia.com/sharedgpu (shared) on the same cluster is what I am doing right now.

klueska commented 4 years ago

Is this related to: https://github.com/awslabs/aws-virtual-gpu-device-plugin

zw0610 commented 4 years ago

The tech blog from Nvidia featuring MIG in Ampere mentioned:

a new resource type in Kubernetes via the NVIDIA Device Plugin

Could some tell me what the a new resource type is and related code?

klueska commented 4 years ago

Could some tell me what the a new resource type is and related code?

This is unrelated to the current issue (as the current issue relates to GPU sharing on pre A100 GPUs).

However, we (NVIDIA) will be releasing details of MIG support on Kubernetes soon. We have a POC of K8s working with MIG, but we want to involve the community for feedback before settling on a final design. I will share some documents (as well as code for the POC) on this early next week.

abuccts commented 4 years ago

However, we (NVIDIA) will be releasing details of MIG support on Kubernetes soon. We have a POC of K8s working with MIG, but we want to involve the community for feedback before settling on a final design. I will share some documents (as well as code for the POC) on this early next week.

Hi @klueska, any update on this?

We have a customized k8s scheduler extender for topology aware gpu scheduling and are curious about the new interface/design for k8s MIG on A100. Current device plugin sets env for nvidia container runtime which is used to mount gpu actually, so I assume there should be some changes to support MIG on k8s?

klueska commented 4 years ago

We were planning on waiting until the CUDA 11 release came out to share these documents (because nothing is actually runnable for MIG without CUDA 11). However, we decided to make them public early so that people can get a head start on looking at them and giving feedback.

Here they are: Supporting Multi-Instance GPUs (MIG) in the NVIDIA Container Toolkit Challenges Supporting Multi-Instance GPUs (MIG) in Kubernetes Supporting Multi-Instance GPUs (MIG) in Kubernetes (Proof of Concept) Steps to Enable MIG Support in Kubernetes (Proof of Concept)

Any and all feedback is welcome.

ktarplee commented 4 years ago

I finally got some time to finish reading @klueska plan in the links above. The plan is to only support A100 GPUs with CUDA 11 by supporting pre-determined fixed slices (memory and compute) of GPUs. The different sized slices would be allocated as different resource types in k8s. This sounds like a good plan for fixed slices without sharing memory and compute.

The modifications I made to the nvidia device plugin solve a slightly different problem. There are cases when you do not need guaranteed GPU availability. For example, when you are interactively experimenting in ML (think Jupyter Hub) you actually want access to all the GPU memory and all the compute resources but you might only need it for a few seconds or a few minutes and then you free them up (stop computing and free the memory). Another user on the same system can kick off their test and all is good. We can even have multiple users running their tests at the same time so long as they do not exceed the total amount of memory on the GPU.

I think there are strong use cases for both and in fact both solutions can co-exist with each other. For example you can have a parameter to create replicas of the GPU slices to allow sharing at the GPU slice level if desired.

klueska commented 4 years ago

A beta release of the plugin containing MIG support has now been released: https://github.com/NVIDIA/k8s-device-plugin/tree/v0.7.0-rc.1

As part of this, we added support for deploying the plugin via helm, including the ability to set the MIG strategy you wish to use in your deployment. Details about the various MIG strategies and how they work can be found at the link below: Supporting Multi-Instance GPUs (MIG) in Kubernetes.

A beta version of MIG Support for gpu-feature-discovery should be available very soon.

klueska commented 4 years ago

@ktarplee I agree that what you propose could be useful in a testing environment or in an environment where users know exactly what they are getting into when requesting access to a sharedgpu. With the recent reorganization of the plugin and the ability to deploy it via helm charts, I am actually more open to adding specialized flags such as this now. Could you write up your proposal in more detail / point me at your existing fork where you have this implemented already.

ktarplee commented 4 years ago

@klueska The patch was approved for public release by the U.S. Air Force so we should be able to release it soon (attach it to this issue). Would you prefer I make a pull request targeting the branch (v0.7.0-rc.1) or master?

zvonkok commented 4 years ago

@klueska @ktarplee I don't know if you aware of this but wanted to share something Alibaba has implemented if we are going down this route:

https://www.alibabacloud.com/blog/gpu-sharing-scheduler-extender-now-supports-fine-grained-kubernetes-clusters_594926

https://github.com/AliyunContainerService/gpushare-device-plugin

https://github.com/AliyunContainerService/gpushare-scheduler-extender

ktarplee commented 4 years ago

Attached is the patch (to be applied to master) that was approved for public release. It was developed by ACT3.

nvidia.diff

ktarplee commented 4 years ago

@zvonkok Thanks for sharing the Alibaba approach. After reading the docs it seems your approach requires extending the scheduler and a custom device plugin. Comparing your approach to MIG approach is seems yours only schedules and limits the GPU memory but not the GPU cuda cores. Is that correct? MIG partitions GPU memory and cuda cores into chunks determined at deployment time.

@zvonkok Does your approach allow a GPU to be partitioned arbitrary at scheduling time (not deployment time)? It appears that the ALIYUN_COM_GPU_MEM_POD env var (passed to the container) is uses to set the GPU memory. I presume that is ignored by the standard nvidia runtime (via nvidia-docker2). Do you require a custom nvidia runtime for your GPU memory limit to be enforced?

ktarplee commented 4 years ago

Just FYI, I just became aware of a similar approach to sharing GPUs (independently developed from my implementation).

RDarrylR commented 4 years ago

I finally got some time to finish reading @klueska plan in the links above. The plan is to only support A100 GPUs with CUDA 11 by supporting pre-determined fixed slices (memory and compute) of GPUs. The different sized slices would be allocated as different resource types in k8s. This sounds like a good plan for fixed slices without sharing memory and compute.

The modifications I made to the nvidia device plugin solve a slightly different problem. There are cases when you do not need guaranteed GPU availability. For example, when you are interactively experimenting in ML (think Jupyter Hub) you actually want access to all the GPU memory and all the compute resources but you might only need it for a few seconds or a few minutes and then you free them up (stop computing and free the memory). Another user on the same system can kick off their test and all is good. We can even have multiple users running their tests at the same time so long as they do not exceed the total amount of memory on the GPU.

I think there are strong use cases for both and in fact both solutions can co-exist with each other. For example you can have a parameter to create replicas of the GPU slices to allow sharing at the GPU slice level if desired.

This is what we are looking for as well. I just saw a talk at Kubecon Europe by Samed Güner from SAP about an attempt at this. It involved over advertising the GPU as well. He also mentioned the following work in this area:

https://github.com/Deepomatic/shared-gpu-nvidia-k8s-device-plugin https://github.com/tkestack/gpu-manager https://github.com/NTHU-LSALAB/KubeShare

In our case we will not have the budget to buy A100's or newer (the only place MIG will be supported I believe) and need a solution for older cards (like the Telsa T4 that we have).

optimuspaul commented 3 years ago

We also don't have budget for A100's or even Tesla anything... we have a bunch of GTX and RTX cards that need to be shared on our cluster.

pen-pal commented 3 years ago

hi. any update on this ?

ktarplee commented 3 years ago

@M-A-N-I-S-H-K At ACT3 we have a private fork of this project that adds GPU sharing. We just updated it to also support MIG (by pulling in the upstream changes from this project). It can now share whole GPUs or parts of GPUs (MIG slices of a GPU) to a maximum number of pods (the replication factor). It can also rename the devices. For example, we use nvidia.com/gpu for whole GPUs and nvidia.com/sharedgpu for shared GPUs. In the case of MIGs you get resource names such as nvidia.com/mig-3g.20gb. In some cases you want the MIG name to be mapped to something else such as nvidia.com/gpu-small and you can then also map your nodes with say K80 to the same extended resource, nvidia.com/gpu-small.

We intend to public release that code soon (it has already been approved for public release in the patch form above, nvidia.diff). We are currently adding an allocation policy so that the least used raw device is allocated to a new pod instead of just a random one. This will help prevent sharing until it is necessary (you have more pods requesting GPUs than you have actual GPUs).

I should mention that I am happing to make a pull request from our work as well to get that work back into this project.

ktarplee commented 3 years ago

Recently I also became aware of another possible way to share GPUs by literally replicating the devices with symlinks in the /dev directory. Here is the link. I have not tried this yet.

bryanjonas commented 3 years ago

Recently I also became aware of another possible way to share GPUs by literally replicating the devices with symlinks in the /dev directory. Here is the link. I have not tried this yet.

I had no luck with this tactic when using the official Nvidia k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin). Obviously I had to modify the gpu-sharing-daemonset.yaml file to suit my bare-metal installation. I am able to see the 16 devices created on the GPU node but the k8s-device-plugin must have a different way of recognizing the GPUs because only one shows up in kubectl.

ktarplee commented 3 years ago

I just submitted a pull request #239 to add GPU sharing. We have been using this approach without issues for 9+ months at my organization. We have some nodes share GPUs (does not require A100 GPUs) and others that only allow exclusive access to GPUs (for heavy GPU workloads).

ktarplee commented 3 years ago

I just submitted the gitlab MR for this feature.

rjanovski commented 3 years ago

Hi @ktarplee

it looks like nvidia will not take ownership of this code.

I was thinking maybe it can be revised as an independent device-plugin on top of the official plugin?

I.e. after installing the official k8s-device-plugin, we then install the shared-device-plugin that can be just a thin wrapper over nvidia's official code. pods still use nvidia.com/sharedgpu and the implementation will just use the official plugin.

do you think it can be done?

ktarplee commented 3 years ago

@rjanovski a few weeks ago we realized that we could do exactly what you described. Essentially the kubernetes sidecar adapter pattern. We do plan on implementing this approach shortly. The benefits are:

rjanovski commented 3 years ago

cool! so sidecar was a better fit than device-plugin framework? simpler to use? generic solution seems fine although I'd settle for just nvidia gpu first :) let me know when its ready, maybe I can help test

eyalhir74 commented 3 years ago

@ktarplee @rjanovski - probably a naiive question/beginner question. What if I want each GPU to count as two? I don't care about scheduling/load ballancing etc. I know what I'm doing (CUDA/GPU wise), and just want the plugin to see each GPU as two GPUs. Is there a simple solution for that? I was under the impression that I would just play with the .go files here and manage to do this, however still couldn't make two pods to run on the same GPU. Any ideas/suggestions?

Thanks!

rjanovski commented 3 years ago

@eyalhir74 this vGPU plugin seemed most promising from my testing, just make sure to configure it with virtual gpu memory as well to avoid memory issues. scheduling, however, is not well supported. it may schedule 2 pods on single gpu even if vacant gpus are available. to overcome this you may need to also provide a custom scheduler to kubernetes (see aliyun solution for example)

ktarplee commented 3 years ago

@eyalhir74 This does exactly what you said. It will let kubernetes assign a GPU to two pods. It will actually try to schedule pods on different gpus if possible. If you ask for two shared gpus you will get two physical gpus if possible. We are working on an improvement to this approach that does not require any modifications to the device plugin since Nvidia does not want to accept the merge request.

xiaoxubeii commented 2 years ago

@ktarplee @rjanovski - probably a naiive question/beginner question. What if I want each GPU to count as two? I don't care about scheduling/load ballancing etc. I know what I'm doing (CUDA/GPU wise), and just want the plugin to see each GPU as two GPUs. Is there a simple solution for that? I was under the impression that I would just play with the .go files here and manage to do this, however still couldn't make two pods to run on the same GPU. Any ideas/suggestions?

Thanks!

We have provided customers with the ability of GPU sharing in production, which is realized through the self-developed open source GPU framework Nano GPU, which can provide shared scheduling and allocation of GPU cards under Kubernetes. However, it requires the installation of additional extender schedulers and device plug-ins. BTW it is difficult to implement container-level gpu share completely depending on the native nvidia docker.

amybachir commented 2 years ago

@klueska Did the MIG support make it to an official release? I'm looking for a trusted and supported gpu sharing solution for production workloads. Thanks!

jmpolom commented 2 years ago

cc

amybachir commented 2 years ago

I was looking into AWS ec2 instances which support MIG mode and I was very disappointed. The only instance they have available is p4d.24xlarge . I definitely don't need each instance to be this large. It makes the MIG support in this plugin useless. Have you all seen any other alternatives? Screen Shot 2022-04-28 at 2 41 58 PM

jmpolom commented 2 years ago

@amybachir is it possible you could get by with fewer, larger instances? Or do you not have enough workload to saturate even a single p4d.24xlarge?

amybachir commented 2 years ago

@jmpolom unfortunately we don't have enough workloads to utilize a whole GPU of type p4d.24xlarge which defeats the purpose of sharing gpu. We also have autoscaling setup so our workloads scale up and down based on traffic so we can't have a large node like this with only a fraction of it being utilized.

amybachir commented 2 years ago

@ktarplee I want to try out GPU sharing from your fork as temp solution until Nvidia officially release an update. I want to use T4 for GPU sharing which is not currently supported.

I noticed you have another MR open in a Gitlab repo where is the most recent work?

rjanovski commented 2 years ago

UPDATE: nvidia now supports time-sharing gpu on kubernetes https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/

grgalex commented 1 year ago

@RDarrylR @ktarplee @rjanovski @amybachir

I have just released nvshare, a transparent GPU sharing mechanism without memory size constraints, based on my diploma thesis.

You can check it out at https://github.com/grgalex/nvshare

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] commented 2 months ago

This issue was automatically closed due to inactivity.