bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.73k stars 512 forks source link

Compatibility with Nvidia GPU Operator #4162

Open dgr237 opened 2 months ago

dgr237 commented 2 months ago

What I'd like: I am looking at Bottle Rocket for EKS and was interested in whether this OS is compatible with the Nvidia GPU Operator. In the documentation for the Nvidia GPU Operator there is documentation on installation steps for installing the Nvidia GPU Operator on various Host OSs but Bottle Rocket is not one which is listed.

Our requirements is to be able to peg the Nvidia Drivers to a specific version of the drivers which has been tested and certified for use by the business. I was therefore looking at the Nvidia GPU Operator as a mechanism to do this. We are currently building custom AMIs based on AL2. This process is cumbersome as we have to build the AMI and release it for the business to test. If any issues are identified in the testing we have to start the process again and build with another version of the Nvidia Drivers.

The use of the Nvidia Operator would enable us to simplify this process and enable the business to install different Nvidia Drivers independently without the need to engineer new Custom AMIs. Given that AL2 is due to be deprecated in 2025 we are looking at what to replace the base AMI with either Bottle Rocket or AL 2023. Whether the Nvidia GPU Operator is supported will be one factor which will determine which Host OS we choose.

It would be good if there was documentation whether Bottle Rocket is compatible with the Nvidia GPU Operator and the installation steps needed to install this.

yeazelm commented 2 months ago

Hello @dgr237, thanks for cutting this issue! We do have some documentation on the GPU operator but I admit it is a bit buried in the QUICKSTART doc. We don't recommend that you use the GPU operator with Bottlerocket because the way it operates can cause issues with Bottlerocket. Bottlerocket includes much what you need but we recommend you add additional NVIDIA tools such as DCGM and GPU Feature Discovery by installing them in your cluster by following the helm install instructions provided for each project.

Bottlerocket includes the NVIDIA drivers in the root image and you can't easily change which one is used so pinning your own driver version isn't going to work with the GPU operator on Bottlerocket. Let me know if that helps!