bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.78k stars 519 forks source link

Disable GSP Firmware for select instance families. #3979

Closed larvacea closed 4 months ago

larvacea commented 5 months ago

Description of changes:

For some instance families in AWS, we wish to disable GSP firmware download in the NVIDIA kmod. We can do this by creating a configuration file in /etc/modprobe.d with the desired option, conditionally based on the instance family we fetch from IMDS. This must happen before we invoke driverdog to load the kmod. In bottlerocket, /etc is ephemeral, so we have to do this on each boot.

Testing done:

Manual testing demonstrates that the setting takes effect on the desired instance families, and does not take effect on other instance families. Tested on each kernel version, on aws-eks and aws-ecs variants.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

bcressey commented 5 months ago

For some instance families in AWS, we wish to disable GSP firmware download in the NVIDIA kmod.

Are we able to do this by checking for the hardware in question, or querying for some shared attribute with nvidia-smi, rather than checking a list of instance types?

The reason I ask is that hardware should be the same across a set of instance types, while code that checks the instance type will need to be revisited if newer g4 or g5 types are launched.

larvacea commented 5 months ago

I can take a look. As I understand the technical note today, we must set the option before we load the nvidia kmod. I don't know if nvidia_smi would run without that kmod loaded, and I don't know if there's some other way to query the GPU and discover whether it should or should not download GSP firmware. So let me go looking. I agree that making the decision based on what hardware we see would be preferable to pattern-matching the EC2 instance family.

larvacea commented 5 months ago

Updated branch to incorporate @bcressey's suggestions. I will rebase and squash before merging, eventually, but kept a separate commit to make it fractionally easier to see what's changed, when.

I have not added to the list of instance families here since I have not been able to launch on the newer instance types.