bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.57k stars 506 forks source link

nvidia-container-cli timeout error when running ECS tasks #3960

Open monirul opened 3 months ago

monirul commented 3 months ago

Image I'm using: aws-ecs-2-nvidia (1.20.0)

What I expected to happen: The container that requires NVIDIA GPU should run successfully in ECS variant of bottlerocket and the ECS task should complete successfully.

What actually happened: When i tried to run a workload in the ecs cluster and the workload requires an NVIDIA GPU, the ECS task fails with an error

CannotStartContainerError: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: driver rpc error: timed out: unknown containerKnown 

How to reproduce the problem:

  1. Create a ECS cluster
  2. Provision an p5 instance for ECS-nvidia variant and configure it to join the ECS cluster you create at the first step
  3. Create a task that runs a workload that requires NVIDIA GPU( in my case I used Nvidia smoke test)
  4. Launch the task in the ECS cluster.
  5. Observe the error message indicating a failure to start the container.

Root Cause The issue is caused due to timeout error while loading the driver right before running the container. Generally, the NVIDIA driver gets unloaded when there is no client connected to the driver. kernel mode driver is not already running or connected to a target GPU, the invocation of any program that attempts to interact with that GPU will transparently cause the driver to load and/or initialize the GPU.

Workaround: To avoid the timeout error, we can enable the NVIDIA driver persistence mode by running the command nvidia-smi -pm 1. It allows to keep the GPUs initialized even when no clients are connected and prevents the kernel module from fully unloading software and hardware state when there are no connected clients. This way, we do not need to load the driver before running the containers and prevent timeout error.

Solution According to NVIDIA documentation, to address this error and minimize the initial driver load time, NVIDIA offers a user-space daemon for Linux. This daemon ensures persistence of driver state across CUDA job runs, providing a better and reliable solution compared to the workaround involving persistence mode.

Proposal I propose to include the nvidia-persistenced binary, provided by the nvidia driver, in the bottlerocket. And run it as a systemd unit to ensure the NVIDIA driver remains loaded and available, preventing the timeout error from occurring.

DamienMatias commented 1 month ago

I have a similar error on EKS nodes, not sure if I should create a separate issue 🤔 To give you more details, I'm using Bottlerocket OS 1.20.3 (aws-k8s-1.28) and from what I tested, the issue appears with a g5.48xlarge (8GPUs) and not with a g5.12xlarge (4GPUs) or smaller.