awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.46k stars 1.15k forks source link

bug(AL2): Updated nvidia-ctk to 1.17.0 from 1.16.2 breaks bootstrapping GPU instances #2054

Closed ryanhockstad closed 1 week ago

ryanhockstad commented 1 week ago

What happened: As of the v20241109 update, nvidia-ctk has been updated and no longer works with the containerd config.toml file

At the end of the bootstrap.sh file, bootstrap-gpu.sh is invoked, which itself invokes bootstrap-gpu-nvidia.sh. This script executes the following line: nvidia-ctk runtime configure --runtime=$CONTAINER_RUNTIME --set-as-default

For nvidia-ctk versions < 1.17, everything goes well. But the newest version emits the following logs when this command is executed:

could not infer options from runtimes [runc crun]; using defaults
Wrote updated config to /etc/containerd/config.toml

This means the etc/containerd/config.toml (which is defined here and copied to /etc/containerd/config.toml by the bootstrap.sh script) is completely ignored.

This has the effect of breaking our EKS clusters because we cannot point PAUSE_IMAGE to a repository that is accessible by our clusters. Likewise, all containerd configuration is stuck with the default settings until this gets fixed. I believe

How to reproduce it (as minimally and precisely as possible): Spin up a GPU EC2 instance with the newest EKS GPU AMI with PAUSE_CONTAINER_IMAGE env variable set, and you will see that it gets overwritten. This ONLY occurs on GPU enabled EC2 instances because otherwise the bootstrap-gpu-nvidia.sh script will not be run and the containerd/config.toml file will not be overwritten.

Environment:

cartermckinnon commented 1 week ago

This should be fixed with 1.17.1 of nvidia-container-toolkit, more here: https://github.com/awslabs/amazon-eks-ami/issues/2041

Can you verify you're on 20241109 (not 20241106)?

ryanhockstad commented 1 week ago

Yep, I had the wrong version, I'm currently on 20241106. Will try updating to 20241109.

cartermckinnon commented 1 week ago

Let me know if that doesn't fix things 👍

ryanhockstad commented 1 week ago

I won't be able to run EKS this with the updated AMI for a while, but I just tested out the nvidia-ctk runtime configure command with nvidia-container-toolkit v1.17.0 and v1.17.1, and I can confirm that 1.17.1 resolves the issue from 1.17.0, so I think this fix will work.

cartermckinnon commented 1 week ago

Sounds good, feel free to update and we can re-open if necessary.