Closed dwschulze closed 2 months ago
Running kubectl describe nodes
shows that I have 0 gpus allocatable, which would explain why my pods don't get out of pending state. The plugin is not recognizing my gpus.
You seem to have followed all of the instructions correctly, and your driver seems to be working properly, since you can run nvidia-smi
on the host.
Can you verify your nvidia-docker2
installation by running the following:
docker run nvidia/cuda nvidia-smi
Did you mean nvidia-docker run nvidia/cuda nvidia-smi
?
$ docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.
$ nvidia-docker run nvidia/cuda nvidia-smi
Mon Jun 8 16:24:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 Off | 00000000:09:00.0 Off | Off |
| 26% 18C P8 9W / 250W | 0MiB / 24449MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
What about driver version, kubernetes version, and plugin version? I'm using the latest drivers (4.40) and kubernetes version 1.18.3, and whatever version of the plugin the create daemonset command installs. Other people have reported version compatibility problems.
No, I meant just:
docker run nvidia/cuda nvidia-smi
Assuming you followed the instructions in the QuickStart to make nvidia
the default runtime for docker, then you shouldn't need to use the nvidia-docker
wrapper script (docker
alone will work). In fact it's required in order for Kubernetes support to work.
You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:
{ "default-runtime": "nvidia", <---- This is the important line "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
The directory /etc/docker got overwritten when I was reinstalling things this morning. I made the changes you showed in /etc/docker/daemon.json, restarted kubelet on the nodes and created the daemonset again on the master. Running kubectl describe nodes still shows 0 for Capacity and Allocatable for gpus. I tried the docker command on the node again and this is what I get:
$ docker run nvidia/cuda nvidia-smi docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown. ERRO[0000] error waiting for container: context canceled
I'm running the latest versions of the driver and kubernetes. Is this maybe a versioning problem?
On Mon, Jun 8, 2020 at 10:47 AM Kevin Klues notifications@github.com wrote:
No, I meant just:
docker run nvidia/cuda nvidia-smi
Assuming you followed the instructions in the QuickStart to make nvidia the default runtime for docker, then you shouldn't need to use the nvidia-docker wrapper script (docker alone will work). In fact it's required to work in order for Kubernetes support to work.
You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:
{ "default-runtime": "nvidia", <---- This is the important line "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/176#issuecomment-640745179, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CTJQDD6ZZHHHEMOKTDRVUIZVANCNFSM4NPHQULA .
Did you also restart docker after making that change to daemon.json
?
Shouldn't be a versioning problem.
Until you get the following to work, nothing will work under Kubernetes:
$ docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:349: starting container process caused "exec:
\"nvidia-smi\": executable file not found in $PATH": unknown.
ERRO[0000] error waiting for container: context canceled
Restarting docker allows the docker command to run, and kubectl describe nodes shows 1 gpu Allocatable. When I try to run the examples from this page:
https://docs.nvidia.com/datacenter/kubernetes/kubernetes-upstream/index.html
(have to clone the git repo and branch -- urls on that page are bad) all pods are in pending state. Nothing will run. This is where I was last week.
What should I try next?
On Mon, Jun 8, 2020 at 11:34 AM Kevin Klues notifications@github.com wrote:
Did you also restart docker after making that change to daemon.json?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/176#issuecomment-640769869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CX2NDKKEH2EXMYSBKDRVUOIXANCNFSM4NPHQULA .
Which examples from there are you running?
Most of that site seems to be more about how to deploy Kubernetes in general, with only a very short section about the plugin with a single example:
kubectl create -f https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml
I only glanced at it briefly though, so I could have missed something if there were more.
That one fails to deploy. I’ve cloned the repo and checkout the examples branch and when I run deployment.yml from my file system it creates a deployment with 32 replicas that all stay in the pending state. I’m not sure what is supposed to happen when you request more replicas than you have available nodes, but I would expect one of those 32 pods to run. The application itself just sleeps for 100 seconds, and I assume that is to give you a chance to run kubectl exec -it gpu-pod nvidia-smi
where you would use the pod name of a running pod from the replica set.
For me all 32 just stay in the pending state. Same if I run the pod.yml example.
That is why I wonder if I’ve got some kind of version mismatch. There do seem to be some (undocumented) version requirements between the driver, plugin, and Kubernetes.
From: Kevin Klues notifications@github.com Sent: Monday, June 8, 2020 12:10 PM To: NVIDIA/k8s-device-plugin k8s-device-plugin@noreply.github.com Cc: dwschulze dean.w.schulze@gmail.com; Author author@noreply.github.com Subject: Re: [NVIDIA/k8s-device-plugin] Following the QuickStart but my pod is stuck in pending state (#176)
Which examples from there are you running?
Most of that site seems to be more about how to deploy Kubernetes in general, with only a very short section about the plugin with a single example:
kubectl create -f https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/176#issuecomment-640787707 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CUJ3XKFIUJJOSVSCRTRVUSOTANCNFSM4NPHQULA . https://github.com/notifications/beacon/AAK46CSLKN5MZ5K6MWJPXFDRVUSOTA5CNFSM4NPHQULKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEYY2J6Y.gif
Two things I've noticed.
Creating/deleting the plugin daemonset creates / deletes the file /var/lib/kubelet/device-plugins/nvidia.sock on the nodes.
The nvidia-device-plugin.yml contains
containers:
- image: nvidia/k8s-device-plugin:1.0.0-beta6
There is no nvidia/k8s-device-plugin docker image on my master node. There is also no docker container running on my master node with nvid in the name. I haven't used a daemonset before. Should I be able to see a docker imager or container with that name?
There shouldn't be any plugins on your master -- it only runs on worker nodes.
There shouldn't be any plugins on your master -- it only runs on worker nodes.
Oh, they're on the nodes:
nvidia/k8s-device-plugin 1.0.0-beta6 c0fa7866a301 6 weeks ago 64.2MB
Also, which version of the CUDA developer libraries do I need to work with the source code?
I was just responding to your statement of:
There is no nvidia/k8s-device-plugin docker image on my master node.
Regarding:
Also, which version of the CUDA developer libraries do I need to work with the source code?
I'm not sure what you are asking here. What is "the source code"?
I got the source code by cloning this github page. The build instructions say:
Without Docker
Build
$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build
So they expect you to have cuda libraries installed, but they don’t say which version of cuda.
From: Kevin Klues notifications@github.com Sent: Wednesday, June 10, 2020 7:15 AM To: NVIDIA/k8s-device-plugin k8s-device-plugin@noreply.github.com Cc: dwschulze dean.w.schulze@gmail.com; Author author@noreply.github.com Subject: Re: [NVIDIA/k8s-device-plugin] Following the QuickStart but my pod is stuck in pending state (#176)
I was just responding to your statement of:
There is no nvidia/k8s-device-plugin docker image on my master node.
Regarding:
Also, which version of the CUDA developer libraries do I need to work with the source code?
I'm not sure what you are asking here. What is "the source code"?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/176#issuecomment-642000526 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CSAHH4NM2JKY5Y4YZDRV6BOVANCNFSM4NPHQULA . https://github.com/notifications/beacon/AAK46CRO4F5UW6TH2D5HW6TRV6BOVA5CNFSM4NPHQULKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEZCCNDQ.gif
Similar issue here: I'm using EKS (Kubernetes v 1.16.8), and it's creating the nodegroup with the correct AMI. That is, the Amazon EKS-optimized accelerated AMI as described here
The instructions there use the 1.0 beta version of your DS, but I've tried deploying it with both Helm 3, and kubectl ... v0.6.0 ...
. For whatever reason the scheduler is not recognizing the GPU on the host.
FWIW, I have to run sudo docker
on the host to get the above command to work, but when I do (e.g. sudo docker run <image-tag> python /tmp/test.py
) it works as expected. Maybe there's an issue with the AMI?
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 32C P0 25W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
tensor([[0.4713, 0.7497, 0.5766],
[0.3508, 0.8708, 0.7834]], device='cuda:0')
As suggested in the AWS docs, when I run kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
I get <None>
for GPU. Do I need something in my node groups to make sure the nvidia.com/gpu
label is present?
If I try to run a pod (spec below) I get: Warning FailedScheduling 7s default-scheduler 0/3 nodes are available: 2 node(s) didn't match node selector, 3 Insufficient nvidia.com/gpu
The phrase Insufficient nvidia.com/gpu
seems important here, but I'm not sure what combination of label or annotation to
kind: Pod
# some stuff redacted...
metadata:
name: pytorch-test
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: <label-name>
operator: In
values:
- <affinity-tag>
tolerations:
- key: "nontraining"
operator: "Exists"
effect: "NoSchedule"
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
restartPolicy: OnFailure
containers:
- name: pytorch-nvidia
image: <image-tag>
command: ["python", "/tmp/test.py"]
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
When I describe the node it shows the following labels and annotations:
Labels: alpha.eksctl.io/cluster-name=<cluster-name>
alpha.eksctl.io/instance-id=<id>
alpha.eksctl.io/nodegroup-name=<nodegroup-name>
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=p3.2xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2b
<label>=<label>
kubernetes.io/arch=amd64
kubernetes.io/hostname=<host-name>
kubernetes.io/os=linux
nvidia.com/gpu=true
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
Quick follow up for anyone who happens upon this thread... for me the issue was that I had a taint on the node group that prevented the daemonset from being scheduled there. To resolve I just added a toleration to that spec like below, and the DS works like a charm with the default AWS accelerated AMI.
spec:
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: <custom-label>
operator: Exists
effect: NoSchedule
any update on this?
Facing same issue. Any update on this?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.
I've followed the QuickStart instructions down to the Docs section on my two nodes. I took the .yaml shown in the Running GPU Jobs section and ran it from the master with
kubectl apply -f nvidia.cuda.yaml
I had modified the .yaml to set
nvidia.com/gpu: 1
because I only have one gpu on each of my nodes. However my pod stays in the pending state:I've verified that cuda binaries run on both nodes.
Are there other steps I need to take to get this plugin to execute gpu jobs? Is it necessary to execute any of the steps below the Docs section such as the
With Docker Build
section because Option 2 fails?It's not clear what the steps below the Docs section are for.
I'm running on Ubuntu 18.04 with kubernetes 1.18.2.
$ nvidia-smi Sun May 31 10:54:58 2020
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro P6000 Off | 00000000:09:00.0 Off | Off | | 26% 23C P8 8W / 250W | 0MiB / 24449MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+