Open monirul opened 3 months ago
I have a similar error on EKS nodes, not sure if I should create a separate issue 🤔
To give you more details, I'm using Bottlerocket OS 1.20.3 (aws-k8s-1.28)
and from what I tested, the issue appears with a g5.48xlarge
(8GPUs) and not with a g5.12xlarge
(4GPUs) or smaller.
Image I'm using: aws-ecs-2-nvidia (1.20.0)
What I expected to happen: The container that requires NVIDIA GPU should run successfully in ECS variant of bottlerocket and the ECS task should complete successfully.
What actually happened: When i tried to run a workload in the ecs cluster and the workload requires an NVIDIA GPU, the ECS task fails with an error
How to reproduce the problem:
Root Cause The issue is caused due to timeout error while loading the driver right before running the container. Generally, the NVIDIA driver gets unloaded when there is no client connected to the driver. kernel mode driver is not already running or connected to a target GPU, the invocation of any program that attempts to interact with that GPU will transparently cause the driver to load and/or initialize the GPU.
Workaround: To avoid the timeout error, we can enable the NVIDIA driver persistence mode by running the command nvidia-smi -pm 1. It allows to keep the GPUs initialized even when no clients are connected and prevents the kernel module from fully unloading software and hardware state when there are no connected clients. This way, we do not need to load the driver before running the containers and prevent timeout error.
Solution According to NVIDIA documentation, to address this error and minimize the initial driver load time, NVIDIA offers a user-space daemon for Linux. This daemon ensures persistence of driver state across CUDA job runs, providing a better and reliable solution compared to the workaround involving persistence mode.
Proposal I propose to include the
nvidia-persistenced
binary, provided by the nvidia driver, in the bottlerocket. And run it as a systemd unit to ensure the NVIDIA driver remains loaded and available, preventing the timeout error from occurring.