Closed ryanhockstad closed 1 week ago
This should be fixed with 1.17.1 of nvidia-container-toolkit
, more here: https://github.com/awslabs/amazon-eks-ami/issues/2041
Can you verify you're on 20241109
(not 20241106
)?
Yep, I had the wrong version, I'm currently on 20241106. Will try updating to 20241109.
Let me know if that doesn't fix things 👍
I won't be able to run EKS this with the updated AMI for a while, but I just tested out the nvidia-ctk runtime configure
command with nvidia-container-toolkit v1.17.0 and v1.17.1, and I can confirm that 1.17.1 resolves the issue from 1.17.0, so I think this fix will work.
Sounds good, feel free to update and we can re-open if necessary.
What happened: As of the v20241109 update, nvidia-ctk has been updated and no longer works with the containerd config.toml file
At the end of the bootstrap.sh file, bootstrap-gpu.sh is invoked, which itself invokes bootstrap-gpu-nvidia.sh. This script executes the following line:
nvidia-ctk runtime configure --runtime=$CONTAINER_RUNTIME --set-as-default
For nvidia-ctk versions < 1.17, everything goes well. But the newest version emits the following logs when this command is executed:
This means the etc/containerd/config.toml (which is defined here and copied to /etc/containerd/config.toml by the bootstrap.sh script) is completely ignored.
This has the effect of breaking our EKS clusters because we cannot point PAUSE_IMAGE to a repository that is accessible by our clusters. Likewise, all containerd configuration is stuck with the default settings until this gets fixed. I believe
How to reproduce it (as minimally and precisely as possible): Spin up a GPU EC2 instance with the newest EKS GPU AMI with PAUSE_CONTAINER_IMAGE env variable set, and you will see that it gets overwritten. This ONLY occurs on GPU enabled EC2 instances because otherwise the bootstrap-gpu-nvidia.sh script will not be run and the containerd/config.toml file will not be overwritten.
Environment: