Quick Debug Information
OS/Version(e.g. RHEL8.6, Ubuntu22.04): Amazon Linux 2
Kernel Version: 5.10.217-205.860.amzn2
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd 1.7.11-1.amzn2.0.1
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS running kubernetes version 1.29
GPU Operator Version: "v24.3.0" of the helm chart
Current ami for the gpu node: [ami-0fa80d89ddbc29d5d]
Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
gpu operator pods are fine, some validators such as nvidia-cuda-validator are in RunContainerError, etc. K8s job that is spun up fails with the following shown:
`Warning Failed 3s kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: versionGLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown``
Steps to reproduce the issue
Detailed steps to reproduce the issue.
install gpu operator using helm chart, set values to the following:
gpu-operator = { description = "A Helm chart for NVIDIA GPU operator" namespace = "gpu-operator" create_namespace = true chart = "gpu-operator" chart_version = "v24.3.0" repository = "https://nvidia.github.io/gpu-operator" values = [ <<-EOT toolkit: version: v1.13.1-centos7 operator: defaultRuntime: containerd EOT ] }
Spin up kubernetes job that schedules on p3.2xlarge machine, has the following ami: amazon-eks-node-al2023-x86_64-standard-1.29-v20240605
Information to attach (optional if deemed irrelevant)
[ x] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
All running, one validator in RunContainerError (nvidia-cuda-validator)
[x ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
All ready one validator is in a RunContainerError once gpu is up and running.
[ x] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME Some pods were observed in an error state, namely the cuda validator:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 27s (x5 over 2m49s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.3.0" already present on machine Normal Created 27s (x5 over 2m49s) kubelet Created container cuda-validation Warning Failed 9s (x5 over 2m48s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: versionGLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown Warning BackOff 8s (x8 over 2m46s) kubelet Back-off restarting failed container cuda-validation in pod nvidia-cuda-validator-wvf2f_gpu-operator(187e0108-fb5e-4096-a0da-a33eeb120693)`
[ x] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
Some ds was observed in an error state
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
Current ami for the gpu node: [ami-0fa80d89ddbc29d5d]
`Warning Failed 3s kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: versionGLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown``
install gpu operator using helm chart, set values to the following:
gpu-operator = { description = "A Helm chart for NVIDIA GPU operator" namespace = "gpu-operator" create_namespace = true chart = "gpu-operator" chart_version = "v24.3.0" repository = "https://nvidia.github.io/gpu-operator" values = [ <<-EOT toolkit: version: v1.13.1-centos7 operator: defaultRuntime: containerd EOT ] }
Spin up kubernetes job that schedules on p3.2xlarge machine, has the following ami: amazon-eks-node-al2023-x86_64-standard-1.29-v20240605
[x ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE All ready one validator is in a RunContainerError once gpu is up and running.![image](https://github.com/awslabs/data-on-eks/assets/131185632/68341f43-9055-47af-89ab-02e879d8ec42)
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
Some pods were observed in an error state, namely the cuda validator:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 27s (x5 over 2m49s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.3.0" already present on machine Normal Created 27s (x5 over 2m49s) kubelet Created container cuda-validation Warning Failed 9s (x5 over 2m48s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version
GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown Warning BackOff 8s (x8 over 2m46s) kubelet Back-off restarting failed container cuda-validation in pod nvidia-cuda-validator-wvf2f_gpu-operator(187e0108-fb5e-4096-a0da-a33eeb120693)`[ x] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers Some ds was observed in an error state
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
containerd logs journalctl -u containerd > containerd.log