awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
550 stars 180 forks source link

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed #557

Open pythonking6 opened 2 weeks ago

pythonking6 commented 2 weeks ago
  1. Quick Debug Information OS/Version(e.g. RHEL8.6, Ubuntu22.04): Amazon Linux 2 Kernel Version: 5.10.217-205.860.amzn2 Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd 1.7.11-1.amzn2.0.1 K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS running kubernetes version 1.29 GPU Operator Version: "v24.3.0" of the helm chart

Current ami for the gpu node: [ami-0fa80d89ddbc29d5d]

  1. Issue or feature description Briefly explain the issue in terms of expected behavior and current behavior. gpu operator pods are fine, some validators such as nvidia-cuda-validator are in RunContainerError, etc. K8s job that is spun up fails with the following shown:

`Warning Failed 3s kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: versionGLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown``

  1. Steps to reproduce the issue Detailed steps to reproduce the issue.

install gpu operator using helm chart, set values to the following: gpu-operator = { description = "A Helm chart for NVIDIA GPU operator" namespace = "gpu-operator" create_namespace = true chart = "gpu-operator" chart_version = "v24.3.0" repository = "https://nvidia.github.io/gpu-operator" values = [ <<-EOT toolkit: version: v1.13.1-centos7 operator: defaultRuntime: containerd EOT ] }

Spin up kubernetes job that schedules on p3.2xlarge machine, has the following ami: amazon-eks-node-al2023-x86_64-standard-1.29-v20240605

  1. Information to attach (optional if deemed irrelevant) [ x] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE All running, one validator in RunContainerError (nvidia-cuda-validator) image

[x ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE All ready one validator is in a RunContainerError once gpu is up and running. image

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 27s (x5 over 2m49s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.3.0" already present on machine Normal Created 27s (x5 over 2m49s) kubelet Created container cuda-validation Warning Failed 9s (x5 over 2m48s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: versionGLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown Warning BackOff 8s (x8 over 2m46s) kubelet Back-off restarting failed container cuda-validation in pod nvidia-cuda-validator-wvf2f_gpu-operator(187e0108-fb5e-4096-a0da-a33eeb120693)`

[ x] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers Some ds was observed in an error state

image

Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi

containerd logs journalctl -u containerd > containerd.log

vara-bonthu commented 1 week ago

Lowering the nvidia toolkit should resolve the GLIBC issues

Checkout the related issues:

https://github.com/NVIDIA/gpu-operator/issues/72

https://github.com/awslabs/data-on-eks/pull/474#issuecomment-2030015800