Open k0nstantinv opened 3 years ago
I would advise reading up on the device plugin framework, which should help you understand the motivation, use cases, advantages: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-management/device-plugin.md
Containers have all they need to work with GPU from docker/nvidia-container-runtime, don't they?
Yes, you are correct. The nvidia-container-toolkit
stack, which includes libnvidia-container
, nvidia-container-runtime
, etc., is all you need to run GPU workloads in containers. The NVIDIA device plugin is specific to Kubernetes.
I need gpu resource on the node that will be used by the scheduler
Yes, the device plugin makes the Kubernetes scheduler aware of gpu resources in your cluster. In your example, you manually did this. The major advantage of the device plugin is that it automates this process for all nodes and allows you to scale up/down your cluster seamlessly.
@cdesiniotis tanks a lot! A've read it lately. I don't understand how device plugin provides libs to pod container when they have already been provided by nvidia-container-toolkit stack? Where the vars like NVIDIA_VISIBLE_DEVICES etc. are declaring? How an app able to read and understand such a var? Do I need special base image for that? Is it for restricting some GPU abilities from an app? I surfed tens of closed and open issues, still can't understand, It is confusing me a lot
I would advise reading up on the device plugin framework, which should help you understand the motivation, use cases, advantages: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-management/device-plugin.md
Current design proposal link is https://github.com/kubernetes/design-proposals-archive/blob/acc25e14ca83dfda4f66d8cb1f1b491f26e78ffe/resource-management/device-plugin.md
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
Hi! I've read all the k8s docs, I've read all the local docs about plugin itself, I do understand what the nvidia-container-runtime is, I've tried deploy this device plugin, device plguin from GCP, I have no questions how to deploy it etc. But...
I completely can't understand why do I need it? Maybe I didn't understand something. Let me show
I have 1.14 cluster, bare metal node with Tesla k40c and nvidia/cuda drivers installed
here is my nvidia-smi output
docker 19.03 along with nvidia-container-runtime are installed and configured
My setup works
Let me explain what I don't understand. I need gpu resource on the node that will be used by the scheduler, right? Ok, I can PATCH my node with extended resource as it described here
get memory count from nvidia-smi
and push it right into the node status, something like:
that's it! my node has the resource
Here is the pod yaml
I'm going to deploy it all without device-plugin
I can see GPU has been detected although it says that model is not supported in that version of digits image
Docs says
So I can run any count of pods with gpu-enabled containers in k8s without device-plugin. Containers have all they need to work with GPU from docker/nvidia-container-runtime, don't they? How could device plugin help me? What advantages it could give? I appreciate any help, any advises, links to learn or explanations you can give. I just want to make it clear for myself