NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.48k stars 577 forks source link

Following the QuickStart but my pod is stuck in pending state #176

Closed dwschulze closed 2 months ago

dwschulze commented 4 years ago

I've followed the QuickStart instructions down to the Docs section on my two nodes. I took the .yaml shown in the Running GPU Jobs section and ran it from the master with

kubectl apply -f nvidia.cuda.yaml

I had modified the .yaml to set nvidia.com/gpu: 1 because I only have one gpu on each of my nodes. However my pod stays in the pending state:

$ kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
gpu-pod   0/2     Pending   0          24h

I've verified that cuda binaries run on both nodes.

Are there other steps I need to take to get this plugin to execute gpu jobs? Is it necessary to execute any of the steps below the Docs section such as the With Docker Build section because Option 2 fails?

It's not clear what the steps below the Docs section are for.

I'm running on Ubuntu 18.04 with kubernetes 1.18.2.

$ nvidia-smi Sun May 31 10:54:58 2020
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro P6000 Off | 00000000:09:00.0 Off | Off | | 26% 23C P8 8W / 250W | 0MiB / 24449MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

dwschulze commented 3 years ago

Running kubectl describe nodes shows that I have 0 gpus allocatable, which would explain why my pods don't get out of pending state. The plugin is not recognizing my gpus.

klueska commented 3 years ago

You seem to have followed all of the instructions correctly, and your driver seems to be working properly, since you can run nvidia-smi on the host.

Can you verify your nvidia-docker2 installation by running the following:

docker run nvidia/cuda nvidia-smi
dwschulze commented 3 years ago

Did you mean nvidia-docker run nvidia/cuda nvidia-smi ?

$ docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.
$ nvidia-docker run nvidia/cuda nvidia-smi
Mon Jun  8 16:24:29 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 440.59       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:09:00.0 Off |                  Off |
| 26%   18C    P8     9W / 250W |      0MiB / 24449MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

What about driver version, kubernetes version, and plugin version? I'm using the latest drivers (4.40) and kubernetes version 1.18.3, and whatever version of the plugin the create daemonset command installs. Other people have reported version compatibility problems.

klueska commented 3 years ago

No, I meant just:

docker run nvidia/cuda nvidia-smi

Assuming you followed the instructions in the QuickStart to make nvidia the default runtime for docker, then you shouldn't need to use the nvidia-docker wrapper script (docker alone will work). In fact it's required in order for Kubernetes support to work.

You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:

{
"default-runtime": "nvidia",  <---- This is the important line
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
dwschulze commented 3 years ago

The directory /etc/docker got overwritten when I was reinstalling things this morning. I made the changes you showed in /etc/docker/daemon.json, restarted kubelet on the nodes and created the daemonset again on the master. Running kubectl describe nodes still shows 0 for Capacity and Allocatable for gpus. I tried the docker command on the node again and this is what I get:

$ docker run nvidia/cuda nvidia-smi docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown. ERRO[0000] error waiting for container: context canceled

I'm running the latest versions of the driver and kubernetes. Is this maybe a versioning problem?

On Mon, Jun 8, 2020 at 10:47 AM Kevin Klues notifications@github.com wrote:

No, I meant just:

docker run nvidia/cuda nvidia-smi

Assuming you followed the instructions in the QuickStart to make nvidia the default runtime for docker, then you shouldn't need to use the nvidia-docker wrapper script (docker alone will work). In fact it's required to work in order for Kubernetes support to work.

You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:

{ "default-runtime": "nvidia", <---- This is the important line "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/176#issuecomment-640745179, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CTJQDD6ZZHHHEMOKTDRVUIZVANCNFSM4NPHQULA .

klueska commented 3 years ago

Did you also restart docker after making that change to daemon.json?

klueska commented 3 years ago

Shouldn't be a versioning problem.

Until you get the following to work, nothing will work under Kubernetes:

$ docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:349: starting container process caused "exec:
\"nvidia-smi\": executable file not found in $PATH": unknown.
ERRO[0000] error waiting for container: context canceled
dwschulze commented 3 years ago

Restarting docker allows the docker command to run, and kubectl describe nodes shows 1 gpu Allocatable. When I try to run the examples from this page:

https://docs.nvidia.com/datacenter/kubernetes/kubernetes-upstream/index.html

(have to clone the git repo and branch -- urls on that page are bad) all pods are in pending state. Nothing will run. This is where I was last week.

What should I try next?

On Mon, Jun 8, 2020 at 11:34 AM Kevin Klues notifications@github.com wrote:

Did you also restart docker after making that change to daemon.json?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/176#issuecomment-640769869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CX2NDKKEH2EXMYSBKDRVUOIXANCNFSM4NPHQULA .

klueska commented 3 years ago

Which examples from there are you running?

Most of that site seems to be more about how to deploy Kubernetes in general, with only a very short section about the plugin with a single example:

kubectl create -f https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml

I only glanced at it briefly though, so I could have missed something if there were more.

dwschulze commented 3 years ago

That one fails to deploy. I’ve cloned the repo and checkout the examples branch and when I run deployment.yml from my file system it creates a deployment with 32 replicas that all stay in the pending state. I’m not sure what is supposed to happen when you request more replicas than you have available nodes, but I would expect one of those 32 pods to run. The application itself just sleeps for 100 seconds, and I assume that is to give you a chance to run kubectl exec -it gpu-pod nvidia-smi

where you would use the pod name of a running pod from the replica set.

For me all 32 just stay in the pending state. Same if I run the pod.yml example.

That is why I wonder if I’ve got some kind of version mismatch. There do seem to be some (undocumented) version requirements between the driver, plugin, and Kubernetes.

From: Kevin Klues notifications@github.com Sent: Monday, June 8, 2020 12:10 PM To: NVIDIA/k8s-device-plugin k8s-device-plugin@noreply.github.com Cc: dwschulze dean.w.schulze@gmail.com; Author author@noreply.github.com Subject: Re: [NVIDIA/k8s-device-plugin] Following the QuickStart but my pod is stuck in pending state (#176)

Which examples from there are you running?

Most of that site seems to be more about how to deploy Kubernetes in general, with only a very short section about the plugin with a single example:

kubectl create -f https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/176#issuecomment-640787707 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CUJ3XKFIUJJOSVSCRTRVUSOTANCNFSM4NPHQULA . https://github.com/notifications/beacon/AAK46CSLKN5MZ5K6MWJPXFDRVUSOTA5CNFSM4NPHQULKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEYY2J6Y.gif

dwschulze commented 3 years ago

Two things I've noticed.

Creating/deleting the plugin daemonset creates / deletes the file /var/lib/kubelet/device-plugins/nvidia.sock on the nodes.

The nvidia-device-plugin.yml contains

      containers:
      - image: nvidia/k8s-device-plugin:1.0.0-beta6

There is no nvidia/k8s-device-plugin docker image on my master node. There is also no docker container running on my master node with nvid in the name. I haven't used a daemonset before. Should I be able to see a docker imager or container with that name?

klueska commented 3 years ago

There shouldn't be any plugins on your master -- it only runs on worker nodes.

dwschulze commented 3 years ago

There shouldn't be any plugins on your master -- it only runs on worker nodes.

Oh, they're on the nodes:

nvidia/k8s-device-plugin 1.0.0-beta6 c0fa7866a301 6 weeks ago 64.2MB

Also, which version of the CUDA developer libraries do I need to work with the source code?

klueska commented 3 years ago

I was just responding to your statement of:

There is no nvidia/k8s-device-plugin docker image on my master node.

Regarding:

Also, which version of the CUDA developer libraries do I need to work with the source code?

I'm not sure what you are asking here. What is "the source code"?

dwschulze commented 3 years ago

I got the source code by cloning this github page. The build instructions say:

Without Docker

Build

$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build

So they expect you to have cuda libraries installed, but they don’t say which version of cuda.

From: Kevin Klues notifications@github.com Sent: Wednesday, June 10, 2020 7:15 AM To: NVIDIA/k8s-device-plugin k8s-device-plugin@noreply.github.com Cc: dwschulze dean.w.schulze@gmail.com; Author author@noreply.github.com Subject: Re: [NVIDIA/k8s-device-plugin] Following the QuickStart but my pod is stuck in pending state (#176)

I was just responding to your statement of:

There is no nvidia/k8s-device-plugin docker image on my master node.

Regarding:

Also, which version of the CUDA developer libraries do I need to work with the source code?

I'm not sure what you are asking here. What is "the source code"?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/176#issuecomment-642000526 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CSAHH4NM2JKY5Y4YZDRV6BOVANCNFSM4NPHQULA . https://github.com/notifications/beacon/AAK46CRO4F5UW6TH2D5HW6TRV6BOVA5CNFSM4NPHQULKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEZCCNDQ.gif

rlantz-cfa commented 3 years ago

Similar issue here: I'm using EKS (Kubernetes v 1.16.8), and it's creating the nodegroup with the correct AMI. That is, the Amazon EKS-optimized accelerated AMI as described here

The instructions there use the 1.0 beta version of your DS, but I've tried deploying it with both Helm 3, and kubectl ... v0.6.0 .... For whatever reason the scheduler is not recognizing the GPU on the host.

FWIW, I have to run sudo docker on the host to get the above command to work, but when I do (e.g. sudo docker run <image-tag> python /tmp/test.py) it works as expected. Maybe there's an issue with the AMI?

| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P0    25W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

tensor([[0.4713, 0.7497, 0.5766],
        [0.3508, 0.8708, 0.7834]], device='cuda:0') 

As suggested in the AWS docs, when I run kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" I get <None> for GPU. Do I need something in my node groups to make sure the nvidia.com/gpu label is present?

If I try to run a pod (spec below) I get: Warning FailedScheduling 7s default-scheduler 0/3 nodes are available: 2 node(s) didn't match node selector, 3 Insufficient nvidia.com/gpu

The phrase Insufficient nvidia.com/gpu seems important here, but I'm not sure what combination of label or annotation to

kind: Pod
# some stuff redacted...
metadata:
  name: pytorch-test
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: <label-name>
            operator: In
            values: 
              - <affinity-tag>
  tolerations:
  - key: "nontraining"
    operator: "Exists"
    effect: "NoSchedule"
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  restartPolicy: OnFailure
  containers:
  - name: pytorch-nvidia
    image: <image-tag>
    command: ["python", "/tmp/test.py"]
    resources:
      limits:
        nvidia.com/gpu: 1 # requesting 1 GPU

When I describe the node it shows the following labels and annotations:

Labels:             alpha.eksctl.io/cluster-name=<cluster-name>
                    alpha.eksctl.io/instance-id=<id>
                    alpha.eksctl.io/nodegroup-name=<nodegroup-name>
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=p3.2xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    <label>=<label>
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=<host-name>
                    kubernetes.io/os=linux
                    nvidia.com/gpu=true
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
rlantz-cfa commented 3 years ago

Quick follow up for anyone who happens upon this thread... for me the issue was that I had a taint on the node group that prevented the daemonset from being scheduled there. To resolve I just added a toleration to that spec like below, and the DS works like a charm with the default AWS accelerated AMI.

spec:
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  - key: <custom-label>
    operator: Exists
    effect: NoSchedule
tuvovan commented 3 years ago

any update on this?

RakeshRaj97 commented 1 year ago

Facing same issue. Any update on this?

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] commented 2 months ago

This issue was automatically closed due to inactivity.