awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.42k stars 1.14k forks source link

Announcement: NVIDIA 535 series drivers will be backported to EKS optimized Accelerated AMIs with older Kubernetes versions #1448

Closed ptailor1193 closed 10 months ago

ptailor1193 commented 1 year ago

With Kubernetes version 1.28 or later, the EKS optimized Accelerated AMIs support NVIDIA 535 series or later drivers out of box. We plan to back port these drivers to older Kubernetes versions starting with 1.27 on October 10th, 2023.

cartermckinnon commented 1 year ago

⚠️ Note that this is a breaking change!

As noted in the 1.28 launch notes: the 535 series drivers are not compatible with the older chipsets used in the p2 instance family. This change is necessary to support the latest-and-greatest hardware in the p5 instance family. Instances in the p3 and p4 families will not be impacted by this change.

tom-dixon-fiveai commented 1 year ago

Hiya!

Will you also be backporting the 5.10 Linux Kernel with this?

Can I ask how many EKS versions you're going to go back as well please?

cartermckinnon commented 1 year ago

Will you also be backporting the 5.10 Linux Kernel with this?

Yep! The older NVIDIA drivers are the only thing keeping us on 5.4.

Can I ask how many EKS versions you're going to go back as well please?

We intend to make this change in 1.25+.

tom-dixon-fiveai commented 1 year ago

Awesome, thanks very much! :D

tom-dixon-fiveai commented 1 year ago

Ah sorry, one more question: is there an ETA/schedule at all for the 1.25 version?

sidewinder12s commented 1 year ago

Is the GPU AMI build process planned to be exposed more in this repo with this change or is that not changing?

tom-dixon-fiveai commented 11 months ago

Hello again! :)

@cartermckinnon do you know when this might be happening at all/otherwise know of an update on this please?

willgleich commented 11 months ago

@cartermckinnon I didn't see a eks-ami release on October 10th, wondering if the 1.27 backport is released?

Any timeline for 1.26?

cartermckinnon commented 11 months ago

I didn't see a eks-ami release on October 10th

A recent change in the kernel: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9011e49d54dcc7653ebb8a1e05b5badb5ecfa9f9 makes our current combination of NVIDIA and EFA drivers incompatible. We expect to have a path forward shortly; but we have to pause our backports in the meantime.

Is the GPU AMI build process planned to be exposed more in this repo with this change or is that not changing?

Yes, we plan to upstream the NVIDIA-related scripts.

cartermckinnon commented 10 months ago

The next AMI release will extend the 535-series NVIDIA driver and CUDA 12 to Kubernetes versions 1.25 and above.

ptailor1193 commented 10 months ago

NVIDIA 535 series drivers have now been backported to EKS optimized Accelerated AMIs 1.25+