Open shnigam2 opened 1 year ago
From the error looks like images for wrong arch are being pulled. Here is the manifest list for original images. Are other pods (non gpu-operator) running on these worker nodes?
$ docker regctl manifest get registry.k8s.io/nfd/node-feature-discovery:v0.12.1
Name: registry.k8s.io/nfd/node-feature-discovery:v0.12.1
MediaType: application/vnd.docker.distribution.manifest.list.v2+json
Digest: sha256:445ed7b7c8580825c23a6f3835c1f13718fcf72b393f51e852aa5bdda04657e7
Manifests:
Name: registry.k8s.io/nfd/node-feature-discovery:v0.12.1@sha256:d1ceeb01176115bd34c80cbd9fea3fee858ce99ef85a948f0c99bafe7d90e24d
Digest: sha256:d1ceeb01176115bd34c80cbd9fea3fee858ce99ef85a948f0c99bafe7d90e24d
MediaType: application/vnd.docker.distribution.manifest.v2+json
Platform: linux/amd64
Name: registry.k8s.io/nfd/node-feature-discovery:v0.12.1@sha256:9bf668f13883fdb6eb444a2f0de2b44cbab59559ff1593b32ab118f41027b77f
Digest: sha256:9bf668f13883fdb6eb444a2f0de2b44cbab59559ff1593b32ab118f41027b77f
MediaType: application/vnd.docker.distribution.manifest.v2+json
Platform: linux/arm64
Hi @shivamerla thanks for replying,
We have fix the arch issue by pulling the image on the exact node and used. All pods went to Running State but runtime class error for cluster-policy was coming on gpu-operator pod so we upgrade helm to try updated helm. But now gpu-operator-node-feature-discovery-master-6bc95d5666-dlf2q
is started getting into crashloopbackoff. Do we need add or update some more values in values.yaml to overcome this error. Please find the more logs attached for the same.
k get po -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-54759576bd-bn4j2 1/1 Running 0 11m
gpu-operator-node-feature-discovery-master-6bc95d5666-dlf2q 0/1 CrashLoopBackOff 7 (18s ago) 11m
gpu-operator-node-feature-discovery-master-f8785bd48-rg4j6 1/1 Running 0 8h
gpu-operator-node-feature-discovery-worker-2qmd9 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-5gkbl 1/1 Running 0 10m
gpu-operator-node-feature-discovery-worker-9fgrk 1/1 Running 0 10m
gpu-operator-node-feature-discovery-worker-d6ljx 1/1 Running 0 10m
gpu-operator-node-feature-discovery-worker-grv8x 1/1 Running 0 11m
gpu-operator-node-feature-discovery-worker-gvlkg 1/1 Running 0 10m
gpu-operator-node-feature-discovery-worker-twdbp 1/1 Running 0 10m
gpu-operator-node-feature-discovery-worker-wzknt
Recent changes were : -
1.6939336311307774e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}
And clusterpolicy is not creating nvidia objects like role and ClusterRole etc. Required Output from gpu-operator-node-feature-discovery-master
k describe po gpu-operator-node-feature-discovery-master-6bc95d5666-dlf2q -n gpu-operator
Name: gpu-operator-node-feature-discovery-master-6bc95d5666-dlf2q
Namespace: gpu-operator
Priority: 0
Service Account: node-feature-discovery
Node: ip-10-222-100-210.ec2.internal/10.222.100.210
Start Time: Tue, 05 Sep 2023 22:35:02 +0530
Labels: app.kubernetes.io/instance=gpu-operator
app.kubernetes.io/name=node-feature-discovery
pod-template-hash=6bc95d5666
role=master
Annotations: cni.projectcalico.org/containerID: bff2b12407bdf973bfe59302918511dd8d0d28a9b609e8b257567e4137169ca0
cni.projectcalico.org/podIP: 100.121.65.217/32
cni.projectcalico.org/podIPs: 100.121.65.217/32
Status: Running
IP: 100.121.65.217
IPs:
IP: 100.121.65.217
Controlled By: ReplicaSet/gpu-operator-node-feature-discovery-master-6bc95d5666
Containers:
master:
Container ID: containerd://2c6ba69f115adf91d1c2414a845a341e655a86b456276e9b671c9eebdd256081
Image: registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery:v0.13.1
Image ID: registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery@sha256:39352c24c30eb4594157d1da41ec4510879cdacffa782ecfc258a87843c90701
Port: 8080/TCP
Host Port: 0/TCP
Command:
nfd-master
Args:
-port=8080
-enable-nodefeature-api
-featurerules-controller=true
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 05 Sep 2023 22:51:05 +0530
Finished: Tue, 05 Sep 2023 22:51:05 +0530
Ready: False
Restart Count: 8
Liveness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=5s timeout=1s period=10s #success=1 #failure=10
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ptdmn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-ptdmn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 16m default-scheduler Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-master-6bc95d5666-dlf2q to ip-10-222-100-210.ec2.internal
Normal Pulled 15m (x4 over 16m) kubelet Container image “registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery:v0.13.1" already present on machine
Normal Created 15m (x4 over 16m) kubelet Created container master
Normal Started 15m (x4 over 16m) kubelet Started container master
Warning BackOff 103s (x81 over 16m) kubelet Back-off restarting failed container master in pod gpu-operator-node-feature-discovery-master-6bc95d5666-dlf2q_gpu-operator(de9c58c9-f720-49c8-aaaf-d9127b42f8c8)
k logs gpu-operator-node-feature-discovery-master-6bc95d5666-dlf2q -n gpu-operator
exec /usr/bin/nfd-master: exec format error
@shivamerla Now we able to make all pods in running state for gpu-operator , nfd-worker & nfd-master but still gpu-operator is giving runtimeclass error can you please suggest why cluster-policy now not creating runtimeclass object and other object like crd & cr do I need to pass additional values in values.yaml for this. Now we able to make
@shivamerla Here is the values which we are using for gpu-operator version v23.6.0, please let us know if anything need to add explicitily to get cluster-policy to create nvidia runtimeclass, cr & other crds
source:
path: deployments/gpu-operator
repoURL: https://github.com/NVIDIA/gpu-operator.git
targetRevision: v23.6.0
helm:
releaseName: gpu-operator
values: |-
validator:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
daemonsets:
priorityClassName: system-node-critical
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
operator:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: gpu-operator
# If version is not specified, then default is to use chart.AppVersion
# version: v1.7.1
imagePullSecrets: [jfrog-auth]
defaultRuntime: containerd
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
driver:
enabled: true
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: nvidia-kmods-driver-flatcar
version: sha256:c3cd6455b1b853744235c00ed4144d03c5466996dc098bb40f669f25ccb79b34
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
toolkit:
enabled: true
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: container-toolkit
version: v1.13.0-ubuntu20.04
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
devicePlugin:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
dcgm:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: 3.1.7-1-ubuntu20.04
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
dcgmExporter:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: dcgm-exporter
imagePullSecrets:
- frog-auth
version: 3.1.7-3.1.4-ubuntu20.04
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
gfd:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: gpu-feature-discovery
version: v0.8.0-ubi8
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
migManager:
enabled: true
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: k8s-mig-manager
version: v0.5.2-ubuntu20.04
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
node-feature-discovery:
image:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery
imagePullSecrets:
- name: jfrog-auth
master:
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
worker:
tolerations:
- key: "gpu.kubernetes.io/gpu-exists"
operator: "Equal"
value: ""
effect: "NoSchedule"
nodeSelector:
beta.kubernetes.io/os: linux
Hello @shivamerla , We downgraded the GPU operator version to 22.9.0 and now we are getting issue with
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-gg2dz 0/1 Init:0/1 0 26m
nvidia-container-toolkit-daemonset-gdm5t 0/1 Init:0/1 0 19m
nvidia-dcgm-exporter-sk7qz 0/1 Init:0/1 0 26m
nvidia-device-plugin-daemonset-5m9v5 0/1 Init:0/1 0 26m
nvidia-operator-validator-kmbn2 0/1 Init:0/4 0 26m
All these pods are stuck in init. Upon checking getting this error in toolkit pod logs: Warning FailedCreatePodSandBox 4m57s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/b12f1475b82859f6d5e83d7675498e4d7c0cdd967be17241b3052a4deb8ecddb/log.json: no such file or directory): fork/exec /opt/nvidia-runtime/toolkit/nvidia-container-runtime: no such file or directory: unknown Warning FailedCreatePodSandBox 6s (x251 over 4m55s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/f4dbd791a02599a14fc611e3855e3a013faa22ce458c404855ff2fd94a11e945/log.json: no such file or directory): fork/exec /opt/nvidia-runtime/toolkit/nvidia-container-runtime: no such file or directory: unknown
Checked on the node and nvidia service is failing with below error:
ip-10-222-100-98 ~ # journalctl -u nvidia
Sep 08 04:18:43 ip-10-222-100-98 systemd[1]: Started NVIDIA Configure Service.
Sep 08 04:18:43 ip-10-222-100-98 setup-nvidia[1893]: Downloading Flatcar Container Linux Developer Container for version: 3374.2.4
Sep 08 04:18:43 ip-10-222-100-98 setup-nvidia[2027]: % Total % Received % Xferd Average Speed Time Time Time Current
Sep 08 04:18:43 ip-10-222-100-98 setup-nvidia[2027]: Dload Upload Total Spent Left Speed
Sep 08 04:18:45 ip-10-222-100-98 setup-nvidia[2027]: [316B blob data]
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[1893]: Downloading NVIDIA 510.73.05 Driver
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[3539]: % Total % Received % Xferd Average Speed Time Time Time Current
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[3539]: Dload Upload Total Spent Left Speed
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[3539]: [158B blob data]
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal setup-nvidia[3539]: curl: (22) The requested URL returned error: 404
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal systemd[1]: nvidia.service: Main process exited, code=exited, status=22/n/a
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal systemd[1]: nvidia.service: Failed with result 'exit-code'.
Sep 08 04:19:32 ip-10-222-100-98.ec2.internal systemd[1]: nvidia.service: Consumed 1min 11.161s CPU time.
@shnigam2 in the issue description you say you are using:
GPU Operator Version: gpu-operator:devel-ubi8 v23.3.1
The gpu-operator:devel-ubi8
image is an unmaintained image that should not be used. Can you confirm how you are installing GPU Operator and where you are downloading the helm chart from? Please refer to the installation instructions in the official documentation: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator
@cdesiniotis We have installed through helm chart v22.9.2 now and all pods except nvidia-container-toolkit are in running state now, Toolkit pod remains in CreateContainerError State with error . Please let us know how to fix this readonly file system error as we are using Flatcar 3374.2.4 OS.
(combined from similar events): Error: failed to generate container "47636c5ec8a637b740a19449870da0c5de3eb509a4e645b65f1ec9590e73f13f" spec: failed to generate spec: failed to mkdir "/usr/local/nvidia": mkdir /usr/local/nvidia: read-only file system
@shnigam2 you can specify custom directory for container-toolkit installation using --set toolkit.installDir=<directory>
option. Since default /usr/local/nvidia
seems to be read-only in your case you need to provide a custom directory which is writable on the host.
@shivamerla after changing the install Dir we are getting below issue for toolkit pod.
Warning FailedCreatePodSandBox 116s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/6d828237f3cb1abdde025fe9daeb3db486f6390b1d0cb07dd0e0a8b4ce450f9f/log.json: no such file or directory): fork/exec /opt/nvidia-runtime/toolkit/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 87s (x16 over 114s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/b194d3e7820faa44266f4609581d6645dcb99e4c2ff97af85e90ad5e31bd03cb/log.json: no such file or directory): fork/exec /opt/nvidia-runtime/toolkit/nvidia-container-runtime: no such file or directory: unknown
Also other pods which were working earlier are now in Init state
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-284p6 0/1 Init:0/1 0 3m59s
nvidia-container-toolkit-daemonset-n4h6p 0/1 Init:0/1 0 3m59s
nvidia-dcgm-exporter-z8chl 0/1 Init:0/1 0 4m
nvidia-device-plugin-daemonset-gxtcv 0/1 Init:0/1 0 4m2s
nvidia-driver-daemonset-q65q7 0/1 Init:0/1 0 4m3s
nvidia-operator-validator-6vwsm 0/1 Init:0/4 0 4m4s
@shivamerla @cdesiniotis Please find the helm values which we are using for V22.9.2 gpu-operator release:-
source:
path: deployments/gpu-operator
repoURL: https://github.com/NVIDIA/gpu-operator.git
targetRevision: v22.9.2
helm:
releaseName: gpu-operator
values: |-
validator:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
daemonsets:
priorityClassName: system-node-critical
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
operator:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: gpu-operator
# If version is not specified, then default is to use chart.AppVersion
# version: v1.7.1
imagePullSecrets: [jfrog-auth]
defaultRuntime: containerd
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
driver:
enabled: true
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: nvidia-kmods-driver-flatcar
version: sha256:c3cd6455b1b853744235c00ed4144d03c5466996dc098bb40f669f25ccb79b34
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
toolkit:
env:
- name: CONTAINERD_CONFIG
value: /etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/containerd/containerd.sock
enabled: true
installDir: "/opt/nvidia"
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: container-toolkit
version: v1.11.0-ubuntu20.04
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
devicePlugin:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
dcgm:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: 3.1.3-1-ubuntu20.04
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
dcgmExporter:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: dcgm-exporter
imagePullSecrets:
- frog-auth
version: 3.1.3-3.1.2-ubuntu20.04
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
gfd:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: gpu-feature-discovery
version: v0.7.0-ubi8
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
migManager:
enabled: true
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia
image: k8s-mig-manager
version: v0.5.0-ubuntu20.04
imagePullSecrets:
- jfrog-auth
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
node-feature-discovery:
image:
repository: registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery
imagePullSecrets:
- name: jfrog-auth
worker:
tolerations:
- key: "gpu.kubernetes.io/gpu-exists"
operator: "Equal"
value: ""
effect: "NoSchedule"
nodeSelector:
beta.kubernetes.io/os: linux
Please find the /etc/containerd/config.toml & /etc/containerd/config-cgroupfs.toml
cat /etc/containerd/config.toml
version = 2
# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"
# set containerd as a subreaper on linux when it is not running as PID 1
subreaper = true
# set containerd's OOM score
oom_score = -999
disabled_plugins = []
# grpc configuration
[grpc]
address = "/run/containerd/containerd.sock"
# socket uid
uid = 0
# socket gid
gid = 0
[plugins."containerd.runtime.v1.linux"]
# shim binary name/path
shim = "containerd-shim"
# runtime binary name/path
runtime = "runc"
# do not use a shim when starting containers, saves on memory but
# live restore is not supported
no_shim = false
[plugins."io.containerd.grpc.v1.cri"]
# enable SELinux labeling
enable_selinux = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
# setting runc.options unsets parent settings
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
cat /etc/containerd/config-cgroupfs.toml
version = 2
# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"
# set containerd as a subreaper on linux when it is not running as PID 1
subreaper = true
# set containerd's OOM score
oom_score = -999
disabled_plugins = []
# grpc configuration
[grpc]
address = "/run/containerd/containerd.sock"
# socket uid
uid = 0
# socket gid
gid = 0
[plugins."containerd.runtime.v1.linux"]
# shim binary name/path
shim = "containerd-shim"
# runtime binary name/path
runtime = "runc"
# do not use a shim when starting containers, saves on memory but
# live restore is not supported
no_shim = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
# setting runc.options unsets parent settings
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = false
Please let us know what is required config.toml and parameters which we need to pass. We are using Flatcar 3374.2.4 and after changing inst-dir we are having below state:-
Shobhit_Nigam-GUVA@M-F3V9K9R27N solutions-bkp % k get po -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-nbbpg 0/1 Init:0/1 0 11m
gpu-operator-5444684585-62lgs 1/1 Running 0 34m
gpu-operator-node-feature-discovery-master-c5c8756d-n955p 1/1 Running 0 48m
gpu-operator-node-feature-discovery-worker-7bqhw 1/1 Running 0 48m
gpu-operator-node-feature-discovery-worker-84j8k 1/1 Running 0 48m
gpu-operator-node-feature-discovery-worker-f4jzw 1/1 Running 0 48m
gpu-operator-node-feature-discovery-worker-njgzv 1/1 Running 0 48m
gpu-operator-node-feature-discovery-worker-qgcw2 1/1 Running 0 13m
gpu-operator-node-feature-discovery-worker-r7stg 1/1 Running 0 48m
gpu-operator-node-feature-discovery-worker-t78r8 1/1 Running 0 48m
nvidia-container-toolkit-daemonset-kp8kv 0/1 CrashLoopBackOff 6 (2m4s ago) 7m54s
nvidia-dcgm-exporter-27krj 0/1 Init:0/1 0 11m
nvidia-device-plugin-daemonset-8lxzb 0/1 Init:0/1 0 11m
nvidia-driver-daemonset-mvwgc 1/1 Running 0 12m
nvidia-operator-validator-4fxl8 0/1 Init:0/4 0 11m
Shobhit_Nigam-GUVA@M-F3V9K9R27N solutions-bkp %
k describe po nvidia-container-toolkit-daemonset-kp8kv -n gpu-operator
Name: nvidia-container-toolkit-daemonset-kp8kv
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: ip-10-222-101-240.ec2.internal/10.222.101.240
Start Time: Sat, 09 Sep 2023 09:16:45 +0530
Labels: app=nvidia-container-toolkit-daemonset
controller-revision-hash=5c4c677c5c
pod-template-generation=4
Annotations: cni.projectcalico.org/containerID: 98792da7df56372c71ed6dc720b9e9549157936ba3ce79e0c65bb08a569bc717
cni.projectcalico.org/podIP: 100.119.148.202/32
cni.projectcalico.org/podIPs: 100.119.148.202/32
Status: Running
IP: 100.119.148.202
IPs:
IP: 100.119.148.202
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: containerd://e5a5598201e33e994965f7d57d38e77984f1defdb72b07b1d324268282a6d685
Image: registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:v22.9.0
Image ID: registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator@sha256:90fd8bb01d8089f900d35a699e0137599ac9de9f37e374eeb702fc90314af5bf
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 09 Sep 2023 09:16:46 +0530
Finished: Sat, 09 Sep 2023 09:16:46 +0530
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5m82 (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID: containerd://5d89194c2f09b55094ee9658dab2f609c9f7275375ca86d793a86ee6394cd6c6
Image: registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.11.0-ubuntu20.04
Image ID: registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit@sha256:7d26e7ece832f32f80727ff4cafb2aa2f72c79af16655709603bb4bb1efc6f6a
Port: <none>
Host Port: <none>
Command:
bash
-c
Args:
/opt/nvidia-runtime
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 126
Started: Sat, 09 Sep 2023 09:22:35 +0530
Finished: Sat, 09 Sep 2023 09:22:35 +0530
Ready: False
Restart Count: 6
Environment:
RUNTIME_ARGS: --socket /runtime/sock-dir/containerd.sock --config /runtime/config-dir/config.toml
CONTAINERD_CONFIG: /etc/containerd/config.toml
CONTAINERD_SOCKET: /run/containerd/containerd.sock
RUNTIME: containerd
CONTAINERD_RUNTIME_CLASS: nvidia
Mounts:
/host from host-root (ro)
/opt/nvidia from toolkit-install-dir (rw)
/opt/nvidia-runtime from nvidia-local (rw)
/run/nvidia from nvidia-run-path (rw)
/runtime/config-dir/ from containerd-config (rw)
/runtime/sock-dir/ from containerd-socket (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5m82 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nvidia-local:
Type: HostPath (bare host directory volume)
Path: /opt/nvidia-runtime
HostPathType:
nvidia-run-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /opt/nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
containerd-config:
Type: HostPath (bare host directory volume)
Path: /etc/containerd
HostPathType:
containerd-socket:
Type: HostPath (bare host directory volume)
Path: /run/containerd
HostPathType:
kube-api-access-z5m82:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: registry.cloud/gpu=true
nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: gpu.kubernetes.io/gpu-exists:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m19s default-scheduler Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-kp8kv to ip-10-222-101-240.ec2.internal
Normal Pulled 8m19s kubelet Container image "registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:v22.9.0" already present on machine
Normal Created 8m19s kubelet Created container driver-validation
Normal Started 8m19s kubelet Started container driver-validation
Normal Started 7m36s (x4 over 8m19s) kubelet Started container nvidia-container-toolkit-ctr
Normal Pulled 6m44s (x5 over 8m19s) kubelet Container image "registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.11.0-ubuntu20.04" already present on machine
Normal Created 6m44s (x5 over 8m19s) kubelet Created container nvidia-container-toolkit-ctr
Warning BackOff 3m14s (x26 over 8m17s) kubelet Back-off restarting failed container nvidia-container-toolkit-ctr in pod nvidia-container-toolkit-daemonset-kp8kv_gpu-operator(b3062286-8f0d-42e2-9d85-b86059b43429)
Log of toolkit pod
k logs nvidia-container-toolkit-daemonset-kp8kv -n gpu-operator
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
bash: /opt/nvidia-runtime: Is a directory
Where as other daemonset of nvidia are in init with below error :-
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
@shivamerla @cdesiniotis Reference of image gpu-operator:devel-ubi8 v23.3.1 is as below:
https://github.com/NVIDIA/gpu-operator/blob/v23.3.1/deployments/gpu-operator/Chart.yaml
@shnigam2 I'd recommend installing our released helm charts from our official helm repository:
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
If you need to install the chart from our github repository, then you need to override version
and appVersion
in Chart.yaml
to the desired release version (e.g. v23.3.1
).
@cdesiniotis We have used v23.3.1 image for both operator and validator and changed below env parameter to over come /usr/local/nvidia read only issue
env:
- name: ROOT
value: /nvidia
- mountPath: /nvidia
name: toolkit-install-dir
- hostPath:
path: /nvidia
type: ""
name: toolkit-install-dir
But now We are getting below error from toolkit pod
k logs nvidia-container-toolkit-daemonset-klv92 -n gpu-operator
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
/bin/bash: /opt/nvidia-runtime: Is a directory
k describe po nvidia-container-toolkit-daemonset-klv92 -n gpu-operator
Name: nvidia-container-toolkit-daemonset-klv92
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: ip-10-222-101-191.ec2.internal/10.222.101.191
Start Time: Mon, 11 Sep 2023 23:49:35 +0530
Labels: app=nvidia-container-toolkit-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=75d6c4bcb
helm.sh/chart=gpu-operator-v1.0.0-devel
pod-template-generation=4
Annotations: cni.projectcalico.org/containerID: cbb7dbc4189165ab4e1211b632f6840ef7c4c4735a78ff7e53f7c6cc8c6dfeec
cni.projectcalico.org/podIP: 100.112.203.160/32
cni.projectcalico.org/podIPs: 100.112.203.160/32
Status: Running
IP: 100.112.203.160
IPs:
IP: 100.112.203.160
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: containerd://1f299fb8a2f654e48c2e85e3659d4d334ff0e143a15038de25b37364b1b619f1
Image: registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:v23.3.1
Image ID: registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator@sha256:4e0df156606c4d3b73bd4a3d9a9ead30e056bd9b23271b4a87b3634201c820b4
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 11 Sep 2023 23:49:36 +0530
Finished: Mon, 11 Sep 2023 23:49:36 +0530
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sztkd (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID: containerd://9d4cdff547e5d63962b83c12d50e8e07b7e2cabbd7e5db6a2decfea849626a2b
Image: registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.13.0-ubuntu20.04
Image ID: registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit@sha256:91e028c8177b4896b7d79f08c64f3a84cb66a0f5a3f32b844d909ebbbd7e0369
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
Args:
/opt/nvidia-runtime
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 126
Started: Mon, 11 Sep 2023 23:52:39 +0530
Finished: Mon, 11 Sep 2023 23:52:39 +0530
Ready: False
Restart Count: 5
Environment:
ROOT: /nvidia
RUNTIME_ARGS: --config /runtime/config-dir/config.toml --socket /runtime/sock-dir/containerd.sock
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND: management.nvidia.com/gpu
RUNTIME: containerd
CONTAINERD_RUNTIME_CLASS: nvidia
Mounts:
/bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
/host from host-root (ro)
/nvidia from toolkit-install-dir (rw)
/opt/nvidia-runtime from nvidia-local (rw)
/run/nvidia from nvidia-run-path (rw)
/runtime/config-dir/ from containerd-config (rw)
/runtime/sock-dir/ from containerd-socket (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sztkd (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nvidia-local:
Type: HostPath (bare host directory volume)
Path: /opt/nvidia-runtime
HostPathType:
nvidia-container-toolkit-entrypoint:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-container-toolkit-entrypoint
Optional: false
nvidia-run-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
containerd-config:
Type: HostPath (bare host directory volume)
Path: /etc/containerd
HostPathType: DirectoryOrCreate
containerd-socket:
Type: HostPath (bare host directory volume)
Path: /run/containerd
HostPathType:
kube-api-access-sztkd:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: registry.cloud/gpu=true
nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: gpu.kubernetes.io/gpu-exists:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m22s default-scheduler Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-klv92 to ip-10-222-101-191.ec2.internal
Normal Pulled 4m21s kubelet Container image "registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:v23.3.1" already present on machine
Normal Created 4m21s kubelet Created container driver-validation
Normal Started 4m21s kubelet Started container driver-validation
Normal Started 3m37s (x4 over 4m21s) kubelet Started container nvidia-container-toolkit-ctr
Warning BackOff 3m1s (x8 over 4m19s) kubelet Back-off restarting failed container nvidia-container-toolkit-ctr in pod nvidia-container-toolkit-daemonset-klv92_gpu-operator(93115974-b27a-4cd8-bccb-7f9c6b428d22)
Normal Pulled 2m48s (x5 over 4m21s) kubelet Container image "registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.13.0-ubuntu20.04" already present on machine
Normal Created 2m48s (x5 over 4m21s) kubelet Created container nvidia-container-toolkit-ctr
Please let us know how to fix this /bin/bash: /opt/nvidia-runtime: Is a directory
error. Please find the pod status after following the same
k get po -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-s975q 0/1 Init:0/1 0 17m
gpu-operator-68c76fff5c-txpln 1/1 Running 1 (19m ago) 4h44m
gpu-operator-node-feature-discovery-master-f8785bd48-flklt 1/1 Running 0 5h27m
gpu-operator-node-feature-discovery-worker-2n4m9 1/1 Running 0 5h27m
gpu-operator-node-feature-discovery-worker-7crgr 1/1 Running 0 5h18m
gpu-operator-node-feature-discovery-worker-g8ws5 1/1 Running 0 5h26m
gpu-operator-node-feature-discovery-worker-h72sm 1/1 Running 0 5h26m
gpu-operator-node-feature-discovery-worker-lh6n7 1/1 Running 0 5h26m
gpu-operator-node-feature-discovery-worker-mp54f 1/1 Running 0 5h26m
gpu-operator-node-feature-discovery-worker-qx4kl 1/1 Running 0 5h26m
nvidia-container-toolkit-daemonset-klv92 0/1 CrashLoopBackOff 5 (2m36s ago) 5m40s
nvidia-dcgm-exporter-pfjfg 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-tqdcz 0/1 Init:0/1 0 17m
nvidia-driver-daemonset-25g89 1/1 Running 0 17m
nvidia-operator-validator-8fb49 0/1 Init:0/4 0 17m
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
gpu-operator-node-feature-discovery-worker pods are going into crashloopbackoff and logs are showing :-
3. Steps to reproduce the issue
4. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
[ ] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
k logs gpu-operator-node-feature-discovery-worker-9jxwb -n gpu-operator
exec /usr/bin/nfd-worker: exec format error
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh chmod +x must-gather.sh ./must-gather.sh