NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.75k stars 282 forks source link

Gpu operator giving runtimeclasserror on k8s 1.26 using gpu-operator version v23.6 and cluster-policy not creating nvidia runtimeclass #575

Open shnigam2 opened 1 year ago

shnigam2 commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

gpu-operator-node-feature-discovery-worker pods are going into crashloopbackoff and logs are showing :-

exec /usr/bin/nfd-worker: exec format error

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`

k logs gpu-operator-node-feature-discovery-worker-9jxwb -n gpu-operator
exec /usr/bin/nfd-worker: exec format error

 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh chmod +x must-gather.sh ./must-gather.sh


**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**
shivamerla commented 1 year ago

From the error looks like images for wrong arch are being pulled. Here is the manifest list for original images. Are other pods (non gpu-operator) running on these worker nodes?

$ docker regctl manifest get registry.k8s.io/nfd/node-feature-discovery:v0.12.1
Name:        registry.k8s.io/nfd/node-feature-discovery:v0.12.1
MediaType:   application/vnd.docker.distribution.manifest.list.v2+json
Digest:      sha256:445ed7b7c8580825c23a6f3835c1f13718fcf72b393f51e852aa5bdda04657e7

Manifests:   

  Name:      registry.k8s.io/nfd/node-feature-discovery:v0.12.1@sha256:d1ceeb01176115bd34c80cbd9fea3fee858ce99ef85a948f0c99bafe7d90e24d
  Digest:    sha256:d1ceeb01176115bd34c80cbd9fea3fee858ce99ef85a948f0c99bafe7d90e24d
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  Platform:  linux/amd64

  Name:      registry.k8s.io/nfd/node-feature-discovery:v0.12.1@sha256:9bf668f13883fdb6eb444a2f0de2b44cbab59559ff1593b32ab118f41027b77f
  Digest:    sha256:9bf668f13883fdb6eb444a2f0de2b44cbab59559ff1593b32ab118f41027b77f
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  Platform:  linux/arm64
shnigam2 commented 1 year ago

Hi @shivamerla thanks for replying,

We have fix the arch issue by pulling the image on the exact node and used. All pods went to Running State but runtime class error for cluster-policy was coming on gpu-operator pod so we upgrade helm to try updated helm. But now gpu-operator-node-feature-discovery-master-6bc95d5666-dlf2q is started getting into crashloopbackoff. Do we need add or update some more values in values.yaml to overcome this error. Please find the more logs attached for the same.

k get po -n gpu-operator                               
NAME                                                          READY   STATUS             RESTARTS      AGE
gpu-operator-54759576bd-bn4j2                                 1/1     Running            0             11m
gpu-operator-node-feature-discovery-master-6bc95d5666-dlf2q   0/1     CrashLoopBackOff   7 (18s ago)   11m
gpu-operator-node-feature-discovery-master-f8785bd48-rg4j6    1/1     Running            0             8h
gpu-operator-node-feature-discovery-worker-2qmd9              1/1     Running            0             11m
gpu-operator-node-feature-discovery-worker-5gkbl              1/1     Running            0             10m
gpu-operator-node-feature-discovery-worker-9fgrk              1/1     Running            0             10m
gpu-operator-node-feature-discovery-worker-d6ljx              1/1     Running            0             10m
gpu-operator-node-feature-discovery-worker-grv8x              1/1     Running            0             11m
gpu-operator-node-feature-discovery-worker-gvlkg              1/1     Running            0             10m
gpu-operator-node-feature-discovery-worker-twdbp              1/1     Running            0             10m
gpu-operator-node-feature-discovery-worker-wzknt 

Recent changes were : -

shnigam2 commented 1 year ago

@shivamerla Now we able to make all pods in running state for gpu-operator , nfd-worker & nfd-master but still gpu-operator is giving runtimeclass error can you please suggest why cluster-policy now not creating runtimeclass object and other object like crd & cr do I need to pass additional values in values.yaml for this. Now we able to make

shnigam2 commented 1 year ago

@shivamerla Here is the values which we are using for gpu-operator version v23.6.0, please let us know if anything need to add explicitily to get cluster-policy to create nvidia runtimeclass, cr & other crds

source:
        path: deployments/gpu-operator
        repoURL: https://github.com/NVIDIA/gpu-operator.git
        targetRevision: v23.6.0
        helm:
          releaseName: gpu-operator
          values: |-
            validator:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              imagePullSecrets:
                - jfrog-auth
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            daemonsets:
              priorityClassName: system-node-critical
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            operator:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: gpu-operator
              # If version is not specified, then default is to use chart.AppVersion
              # version: v1.7.1
              imagePullSecrets: [jfrog-auth]
              defaultRuntime: containerd
              tolerations:
              - key: "node-role.kubernetes.io/master"
                operator: "Equal"
                value: ""
                effect: "NoSchedule"
            driver:
              enabled: true
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: nvidia-kmods-driver-flatcar
              version: sha256:c3cd6455b1b853744235c00ed4144d03c5466996dc098bb40f669f25ccb79b34
              imagePullSecrets:
              - jfrog-auth
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            toolkit:
              enabled: true
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: container-toolkit
              version: v1.13.0-ubuntu20.04
              imagePullSecrets:
              - jfrog-auth
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            devicePlugin:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              imagePullSecrets:
                - jfrog-auth
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            dcgm:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: 3.1.7-1-ubuntu20.04
              imagePullSecrets:
              - jfrog-auth
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            dcgmExporter:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: dcgm-exporter
              imagePullSecrets:
              - frog-auth
              version: 3.1.7-3.1.4-ubuntu20.04
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            gfd:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: gpu-feature-discovery
              version: v0.8.0-ubi8
              imagePullSecrets:
              - jfrog-auth
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            migManager:
              enabled: true
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: k8s-mig-manager
              version: v0.5.2-ubuntu20.04
              imagePullSecrets:
              - jfrog-auth
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            node-feature-discovery:
              image:
                repository: registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery
              imagePullSecrets:
              - name: jfrog-auth
              master:
                tolerations:
                - key: "node-role.kubernetes.io/master"
                  operator: "Equal"
                  value: ""
                  effect: "NoSchedule"
              worker:
                tolerations:
                - key: "gpu.kubernetes.io/gpu-exists"
                  operator: "Equal"
                  value: ""
                  effect: "NoSchedule"
                nodeSelector:
                  beta.kubernetes.io/os: linux
gravops commented 1 year ago

Hello @shivamerla , We downgraded the GPU operator version to 22.9.0 and now we are getting issue with

NAME                                                        READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-gg2dz                                 0/1     Init:0/1   0          26m
nvidia-container-toolkit-daemonset-gdm5t                    0/1     Init:0/1   0          19m
nvidia-dcgm-exporter-sk7qz                                  0/1     Init:0/1   0          26m
nvidia-device-plugin-daemonset-5m9v5                        0/1     Init:0/1   0          26m
nvidia-operator-validator-kmbn2                             0/1     Init:0/4   0          26m

All these pods are stuck in init. Upon checking getting this error in toolkit pod logs: Warning FailedCreatePodSandBox 4m57s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/b12f1475b82859f6d5e83d7675498e4d7c0cdd967be17241b3052a4deb8ecddb/log.json: no such file or directory): fork/exec /opt/nvidia-runtime/toolkit/nvidia-container-runtime: no such file or directory: unknown Warning FailedCreatePodSandBox 6s (x251 over 4m55s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/f4dbd791a02599a14fc611e3855e3a013faa22ce458c404855ff2fd94a11e945/log.json: no such file or directory): fork/exec /opt/nvidia-runtime/toolkit/nvidia-container-runtime: no such file or directory: unknown

gravops commented 1 year ago

Checked on the node and nvidia service is failing with below error:

ip-10-222-100-98 ~ # journalctl -u nvidia
Sep 08 04:18:43 ip-10-222-100-98 systemd[1]: Started NVIDIA Configure Service.
Sep 08 04:18:43 ip-10-222-100-98 setup-nvidia[1893]: Downloading Flatcar Container Linux Developer Container for version: 3374.2.4
Sep 08 04:18:43 ip-10-222-100-98 setup-nvidia[2027]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Sep 08 04:18:43 ip-10-222-100-98 setup-nvidia[2027]:                                  Dload  Upload   Total   Spent    Left  Speed
Sep 08 04:18:45 ip-10-222-100-98 setup-nvidia[2027]: [316B blob data]
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[1893]: Downloading NVIDIA 510.73.05 Driver
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[3539]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[3539]:                                  Dload  Upload   Total   Spent    Left  Speed
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[3539]: [158B blob data]
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[3539]: curl: (22) The requested URL returned error: 404
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal systemd[1]: nvidia.service: Main process exited, code=exited, status=22/n/a
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal systemd[1]: nvidia.service: Failed with result 'exit-code'.
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal systemd[1]: nvidia.service: Consumed 1min 11.161s CPU time.
cdesiniotis commented 1 year ago

@shnigam2 in the issue description you say you are using:

GPU Operator Version: gpu-operator:devel-ubi8 v23.3.1

The gpu-operator:devel-ubi8 image is an unmaintained image that should not be used. Can you confirm how you are installing GPU Operator and where you are downloading the helm chart from? Please refer to the installation instructions in the official documentation: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator

shnigam2 commented 1 year ago

@cdesiniotis We have installed through helm chart v22.9.2 now and all pods except nvidia-container-toolkit are in running state now, Toolkit pod remains in CreateContainerError State with error . Please let us know how to fix this readonly file system error as we are using Flatcar 3374.2.4 OS.

 (combined from similar events): Error: failed to generate container "47636c5ec8a637b740a19449870da0c5de3eb509a4e645b65f1ec9590e73f13f" spec: failed to generate spec: failed to mkdir "/usr/local/nvidia": mkdir /usr/local/nvidia: read-only file system
shivamerla commented 1 year ago

@shnigam2 you can specify custom directory for container-toolkit installation using --set toolkit.installDir=<directory> option. Since default /usr/local/nvidia seems to be read-only in your case you need to provide a custom directory which is writable on the host.

gravops commented 1 year ago

@shivamerla after changing the install Dir we are getting below issue for toolkit pod.

  Warning  FailedCreatePodSandBox  116s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/6d828237f3cb1abdde025fe9daeb3db486f6390b1d0cb07dd0e0a8b4ce450f9f/log.json: no such file or directory): fork/exec /opt/nvidia-runtime/toolkit/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  87s (x16 over 114s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/b194d3e7820faa44266f4609581d6645dcb99e4c2ff97af85e90ad5e31bd03cb/log.json: no such file or directory): fork/exec /opt/nvidia-runtime/toolkit/nvidia-container-runtime: no such file or directory: unknown

Also other pods which were working earlier are now in Init state

NAME                                                        READY   STATUS              RESTARTS   AGE
gpu-feature-discovery-284p6                                 0/1     Init:0/1            0          3m59s
nvidia-container-toolkit-daemonset-n4h6p                    0/1     Init:0/1            0          3m59s
nvidia-dcgm-exporter-z8chl                                  0/1     Init:0/1            0          4m
nvidia-device-plugin-daemonset-gxtcv                        0/1     Init:0/1            0          4m2s
nvidia-driver-daemonset-q65q7                               0/1     Init:0/1            0          4m3s
nvidia-operator-validator-6vwsm                             0/1     Init:0/4            0          4m4s
shnigam2 commented 1 year ago

@shivamerla @cdesiniotis Please find the helm values which we are using for V22.9.2 gpu-operator release:-

source:
        path: deployments/gpu-operator
        repoURL: https://github.com/NVIDIA/gpu-operator.git
        targetRevision: v22.9.2
        helm:
          releaseName: gpu-operator
          values: |-
            validator:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              imagePullSecrets:
                - jfrog-auth
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            daemonsets:
              priorityClassName: system-node-critical
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            operator:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: gpu-operator
              # If version is not specified, then default is to use chart.AppVersion
              # version: v1.7.1
              imagePullSecrets: [jfrog-auth]
              defaultRuntime: containerd
              tolerations:
              - key: "node-role.kubernetes.io/master"
                operator: "Equal"
                value: ""
                effect: "NoSchedule"
            driver:
              enabled: true
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: nvidia-kmods-driver-flatcar
              version: sha256:c3cd6455b1b853744235c00ed4144d03c5466996dc098bb40f669f25ccb79b34
              imagePullSecrets:
              - jfrog-auth
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            toolkit:
              env:
              - name: CONTAINERD_CONFIG
                value: /etc/containerd/config.toml
              - name: CONTAINERD_SOCKET
                value: /run/containerd/containerd.sock
              enabled: true
              installDir: "/opt/nvidia"
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: container-toolkit
              version: v1.11.0-ubuntu20.04
              imagePullSecrets:
              - jfrog-auth
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            devicePlugin:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              imagePullSecrets:
                - jfrog-auth
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            dcgm:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: 3.1.3-1-ubuntu20.04
              imagePullSecrets:
              - jfrog-auth
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            dcgmExporter:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: dcgm-exporter
              imagePullSecrets:
              - frog-auth
              version: 3.1.3-3.1.2-ubuntu20.04
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            gfd:
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: gpu-feature-discovery
              version: v0.7.0-ubi8
              imagePullSecrets:
              - jfrog-auth
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            migManager:
              enabled: true
              repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
              image: k8s-mig-manager
              version: v0.5.0-ubuntu20.04
              imagePullSecrets:
              - jfrog-auth
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            node-feature-discovery:
              image:
                repository: registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery
              imagePullSecrets:
              - name: jfrog-auth
              worker:
                tolerations:
                - key: "gpu.kubernetes.io/gpu-exists"
                  operator: "Equal"
                  value: ""
                  effect: "NoSchedule"
                nodeSelector:
                  beta.kubernetes.io/os: linux

Please find the /etc/containerd/config.toml & /etc/containerd/config-cgroupfs.toml

cat /etc/containerd/config.toml
version = 2

# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"
# set containerd as a subreaper on linux when it is not running as PID 1
subreaper = true
# set containerd's OOM score
oom_score = -999
disabled_plugins = []

# grpc configuration
[grpc]
address = "/run/containerd/containerd.sock"
# socket uid
uid = 0
# socket gid
gid = 0

[plugins."containerd.runtime.v1.linux"]
# shim binary name/path
shim = "containerd-shim"
# runtime binary name/path
runtime = "runc"
# do not use a shim when starting containers, saves on memory but
# live restore is not supported
no_shim = false

[plugins."io.containerd.grpc.v1.cri"]
# enable SELinux labeling
enable_selinux = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
# setting runc.options unsets parent settings
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
cat /etc/containerd/config-cgroupfs.toml
version = 2

# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"
# set containerd as a subreaper on linux when it is not running as PID 1
subreaper = true
# set containerd's OOM score
oom_score = -999
disabled_plugins = []

# grpc configuration
[grpc]
address = "/run/containerd/containerd.sock"
# socket uid
uid = 0
# socket gid
gid = 0

[plugins."containerd.runtime.v1.linux"]
# shim binary name/path
shim = "containerd-shim"
# runtime binary name/path
runtime = "runc"
# do not use a shim when starting containers, saves on memory but
# live restore is not supported
no_shim = false

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
# setting runc.options unsets parent settings
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = false

Please let us know what is required config.toml and parameters which we need to pass. We are using Flatcar 3374.2.4 and after changing inst-dir we are having below state:-

Shobhit_Nigam-GUVA@M-F3V9K9R27N solutions-bkp % k get po -n gpu-operator         
NAME                                                        READY   STATUS             RESTARTS       AGE
gpu-feature-discovery-nbbpg                                 0/1     Init:0/1           0              11m
gpu-operator-5444684585-62lgs                               1/1     Running            0              34m
gpu-operator-node-feature-discovery-master-c5c8756d-n955p   1/1     Running            0              48m
gpu-operator-node-feature-discovery-worker-7bqhw            1/1     Running            0              48m
gpu-operator-node-feature-discovery-worker-84j8k            1/1     Running            0              48m
gpu-operator-node-feature-discovery-worker-f4jzw            1/1     Running            0              48m
gpu-operator-node-feature-discovery-worker-njgzv            1/1     Running            0              48m
gpu-operator-node-feature-discovery-worker-qgcw2            1/1     Running            0              13m
gpu-operator-node-feature-discovery-worker-r7stg            1/1     Running            0              48m
gpu-operator-node-feature-discovery-worker-t78r8            1/1     Running            0              48m
nvidia-container-toolkit-daemonset-kp8kv                    0/1     CrashLoopBackOff   6 (2m4s ago)   7m54s
nvidia-dcgm-exporter-27krj                                  0/1     Init:0/1           0              11m
nvidia-device-plugin-daemonset-8lxzb                        0/1     Init:0/1           0              11m
nvidia-driver-daemonset-mvwgc                               1/1     Running            0              12m
nvidia-operator-validator-4fxl8                             0/1     Init:0/4           0              11m
Shobhit_Nigam-GUVA@M-F3V9K9R27N solutions-bkp % 
k describe po nvidia-container-toolkit-daemonset-kp8kv -n gpu-operator 
Name:                 nvidia-container-toolkit-daemonset-kp8kv
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 ip-10-222-101-240.ec2.internal/10.222.101.240
Start Time:           Sat, 09 Sep 2023 09:16:45 +0530
Labels:               app=nvidia-container-toolkit-daemonset
                      controller-revision-hash=5c4c677c5c
                      pod-template-generation=4
Annotations:          cni.projectcalico.org/containerID: 98792da7df56372c71ed6dc720b9e9549157936ba3ce79e0c65bb08a569bc717
                      cni.projectcalico.org/podIP: 100.119.148.202/32
                      cni.projectcalico.org/podIPs: 100.119.148.202/32
Status:               Running
IP:                   100.119.148.202
IPs:
  IP:           100.119.148.202
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://e5a5598201e33e994965f7d57d38e77984f1defdb72b07b1d324268282a6d685
    Image:         registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:v22.9.0
    Image ID:      registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator@sha256:90fd8bb01d8089f900d35a699e0137599ac9de9f37e374eeb702fc90314af5bf
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 09 Sep 2023 09:16:46 +0530
      Finished:     Sat, 09 Sep 2023 09:16:46 +0530
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5m82 (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  containerd://5d89194c2f09b55094ee9658dab2f609c9f7275375ca86d793a86ee6394cd6c6
    Image:         registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.11.0-ubuntu20.04
    Image ID:      registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit@sha256:7d26e7ece832f32f80727ff4cafb2aa2f72c79af16655709603bb4bb1efc6f6a
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      /opt/nvidia-runtime
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    126
      Started:      Sat, 09 Sep 2023 09:22:35 +0530
      Finished:     Sat, 09 Sep 2023 09:22:35 +0530
    Ready:          False
    Restart Count:  6
    Environment:
      RUNTIME_ARGS:              --socket /runtime/sock-dir/containerd.sock --config /runtime/config-dir/config.toml
      CONTAINERD_CONFIG:         /etc/containerd/config.toml
      CONTAINERD_SOCKET:         /run/containerd/containerd.sock
      RUNTIME:                   containerd
      CONTAINERD_RUNTIME_CLASS:  nvidia
    Mounts:
      /host from host-root (ro)
      /opt/nvidia from toolkit-install-dir (rw)
      /opt/nvidia-runtime from nvidia-local (rw)
      /run/nvidia from nvidia-run-path (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5m82 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  nvidia-local:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/nvidia-runtime
    HostPathType:  
  nvidia-run-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/containerd
    HostPathType:  
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containerd
    HostPathType:  
  kube-api-access-z5m82:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              registry.cloud/gpu=true
                             nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 gpu.kubernetes.io/gpu-exists:NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  8m19s                   default-scheduler  Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-kp8kv to ip-10-222-101-240.ec2.internal
  Normal   Pulled     8m19s                   kubelet            Container image "registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:v22.9.0" already present on machine
  Normal   Created    8m19s                   kubelet            Created container driver-validation
  Normal   Started    8m19s                   kubelet            Started container driver-validation
  Normal   Started    7m36s (x4 over 8m19s)   kubelet            Started container nvidia-container-toolkit-ctr
  Normal   Pulled     6m44s (x5 over 8m19s)   kubelet            Container image "registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.11.0-ubuntu20.04" already present on machine
  Normal   Created    6m44s (x5 over 8m19s)   kubelet            Created container nvidia-container-toolkit-ctr
  Warning  BackOff    3m14s (x26 over 8m17s)  kubelet            Back-off restarting failed container nvidia-container-toolkit-ctr in pod nvidia-container-toolkit-daemonset-kp8kv_gpu-operator(b3062286-8f0d-42e2-9d85-b86059b43429)

Log of toolkit pod

 k logs  nvidia-container-toolkit-daemonset-kp8kv   -n  gpu-operator       
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
bash: /opt/nvidia-runtime: Is a directory

Where as other daemonset of nvidia are in init with below error :-

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
shnigam2 commented 1 year ago

@shivamerla @cdesiniotis Reference of image gpu-operator:devel-ubi8 v23.3.1 is as below:

https://github.com/NVIDIA/gpu-operator/blob/v23.3.1/deployments/gpu-operator/Chart.yaml

cdesiniotis commented 1 year ago

@shnigam2 I'd recommend installing our released helm charts from our official helm repository:

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator

If you need to install the chart from our github repository, then you need to override version and appVersion in Chart.yaml to the desired release version (e.g. v23.3.1).

shnigam2 commented 1 year ago

@cdesiniotis We have used v23.3.1 image for both operator and validator and changed below env parameter to over come /usr/local/nvidia read only issue

        env:
        - name: ROOT
          value: /nvidia
        - mountPath: /nvidia
          name: toolkit-install-dir

      - hostPath:
          path: /nvidia
          type: ""
        name: toolkit-install-dir

But now We are getting below error from toolkit pod

k logs nvidia-container-toolkit-daemonset-klv92   -n  gpu-operator
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
/bin/bash: /opt/nvidia-runtime: Is a directory
k describe po nvidia-container-toolkit-daemonset-klv92   -n  gpu-operator
Name:                 nvidia-container-toolkit-daemonset-klv92
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 ip-10-222-101-191.ec2.internal/10.222.101.191
Start Time:           Mon, 11 Sep 2023 23:49:35 +0530
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=75d6c4bcb
                      helm.sh/chart=gpu-operator-v1.0.0-devel
                      pod-template-generation=4
Annotations:          cni.projectcalico.org/containerID: cbb7dbc4189165ab4e1211b632f6840ef7c4c4735a78ff7e53f7c6cc8c6dfeec
                      cni.projectcalico.org/podIP: 100.112.203.160/32
                      cni.projectcalico.org/podIPs: 100.112.203.160/32
Status:               Running
IP:                   100.112.203.160
IPs:
  IP:           100.112.203.160
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://1f299fb8a2f654e48c2e85e3659d4d334ff0e143a15038de25b37364b1b619f1
    Image:         registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:v23.3.1
    Image ID:      registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator@sha256:4e0df156606c4d3b73bd4a3d9a9ead30e056bd9b23271b4a87b3634201c820b4
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 11 Sep 2023 23:49:36 +0530
      Finished:     Mon, 11 Sep 2023 23:49:36 +0530
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sztkd (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  containerd://9d4cdff547e5d63962b83c12d50e8e07b7e2cabbd7e5db6a2decfea849626a2b
    Image:         registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.13.0-ubuntu20.04
    Image ID:      registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit@sha256:91e028c8177b4896b7d79f08c64f3a84cb66a0f5a3f32b844d909ebbbd7e0369
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /opt/nvidia-runtime
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    126
      Started:      Mon, 11 Sep 2023 23:52:39 +0530
      Finished:     Mon, 11 Sep 2023 23:52:39 +0530
    Ready:          False
    Restart Count:  5
    Environment:
      ROOT:                                             /nvidia
      RUNTIME_ARGS:                                     --config /runtime/config-dir/config.toml --socket /runtime/sock-dir/containerd.sock
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:  management.nvidia.com/gpu
      RUNTIME:                                          containerd
      CONTAINERD_RUNTIME_CLASS:                         nvidia
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /host from host-root (ro)
      /nvidia from toolkit-install-dir (rw)
      /opt/nvidia-runtime from nvidia-local (rw)
      /run/nvidia from nvidia-run-path (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sztkd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  nvidia-local:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/nvidia-runtime
    HostPathType:  
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  nvidia-run-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containerd
    HostPathType:  
  kube-api-access-sztkd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              registry.cloud/gpu=true
                             nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 gpu.kubernetes.io/gpu-exists:NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m22s                  default-scheduler  Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-klv92 to ip-10-222-101-191.ec2.internal
  Normal   Pulled     4m21s                  kubelet            Container image "registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:v23.3.1" already present on machine
  Normal   Created    4m21s                  kubelet            Created container driver-validation
  Normal   Started    4m21s                  kubelet            Started container driver-validation
  Normal   Started    3m37s (x4 over 4m21s)  kubelet            Started container nvidia-container-toolkit-ctr
  Warning  BackOff    3m1s (x8 over 4m19s)   kubelet            Back-off restarting failed container nvidia-container-toolkit-ctr in pod nvidia-container-toolkit-daemonset-klv92_gpu-operator(93115974-b27a-4cd8-bccb-7f9c6b428d22)
  Normal   Pulled     2m48s (x5 over 4m21s)  kubelet            Container image "registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.13.0-ubuntu20.04" already present on machine
  Normal   Created    2m48s (x5 over 4m21s)  kubelet            Created container nvidia-container-toolkit-ctr

Please let us know how to fix this /bin/bash: /opt/nvidia-runtime: Is a directory error. Please find the pod status after following the same

k get po -n gpu-operator                                          
NAME                                                         READY   STATUS             RESTARTS        AGE
gpu-feature-discovery-s975q                                  0/1     Init:0/1           0               17m
gpu-operator-68c76fff5c-txpln                                1/1     Running            1 (19m ago)     4h44m
gpu-operator-node-feature-discovery-master-f8785bd48-flklt   1/1     Running            0               5h27m
gpu-operator-node-feature-discovery-worker-2n4m9             1/1     Running            0               5h27m
gpu-operator-node-feature-discovery-worker-7crgr             1/1     Running            0               5h18m
gpu-operator-node-feature-discovery-worker-g8ws5             1/1     Running            0               5h26m
gpu-operator-node-feature-discovery-worker-h72sm             1/1     Running            0               5h26m
gpu-operator-node-feature-discovery-worker-lh6n7             1/1     Running            0               5h26m
gpu-operator-node-feature-discovery-worker-mp54f             1/1     Running            0               5h26m
gpu-operator-node-feature-discovery-worker-qx4kl             1/1     Running            0               5h26m
nvidia-container-toolkit-daemonset-klv92                     0/1     CrashLoopBackOff   5 (2m36s ago)   5m40s
nvidia-dcgm-exporter-pfjfg                                   0/1     Init:0/1           0               17m
nvidia-device-plugin-daemonset-tqdcz                         0/1     Init:0/1           0               17m
nvidia-driver-daemonset-25g89                                1/1     Running            0               17m
nvidia-operator-validator-8fb49                              0/1     Init:0/4           0               17m