Open andrescaroc opened 4 months ago
Thanks for reporting this! By any chance do you have the instance running? It seems odd that the device plugin isn't showing any output.
Yes sir, I have the instance running.
I agree, from the previous time that I reported the incident (slack thread) the output was different:
# journalctl -u nvidia-k8s-device-plugin
Apr 14 06:03:47 ip-192-168-114-245.eu-central-1.compute.internal systemd[1]: Dependency failed for Start NVIDIA kubernetes device plugin.
Apr 14 06:03:47 ip-192-168-114-245.eu-central-1.compute.internal systemd[1]: nvidia-k8s-device-plugin.service: Job nvidia-k8s-device-plugin.service/start failed with result 'dependency'.
But this time it is empty
@arnaldo2792 let me know if there are steps you want to perform to diagnose the issue?
I am investigating on this end. On the EC2 g4dn.*
instance family Bottlerocket may require manual intervention to disable GSP firmware download. This has to happen during boot, before Bottlerocket loads the nvidia kmod. I will find the relevant API to set this as a boot parameter and test the results. Here's the relevant line from nvidia-smi -q
:
GSP Firmware Version : 535.183.01
This shows that the nvidia kmod downloaded firmware to the GSP during boot. The desired state is:
GSP Firmware Version : N/A
The slightly better news is that we do have an issue open internally to select the "no GSP download" option on appropriate hardware, without requiring any configuration.
@larvacea I want to thank you for taking the time to investigate this strange issue. Also I am happy that you found some breadcrumbs on what the problem is. :clap:
Here's one way to set the relevant kernel parameter using apiclient:
apiclient apply <<EOF
[settings.boot.kernel-parameters]
"nvidia.NVreg_EnableGpuFirmware"=["0"]
[settings.boot]
reboot-to-reconcile = true
EOF
apiclient reboot
After the instance reboots, nvidia-smi -q
should report N/A for GSP Firmware Version. One can use the same toml fragment as part of instance user data. That's why the toml includes reboot-to-reconcile
: this should result in Bottlerocket rebooting automatically whenever the kernel-parameters
setting changes the kernel command line.
I do not know if this is responsible for the 5% failure rate you see. I'd love to hear if this helps or not.
My understanding is that if that I set the kernel parameter "nvidia.NVreg_EnableGpuFirmware"=["0"]
I can be 100% sure that GSP firmware wont be downloaded and that would be enough for my use case where karpenter is in charge of starting and shutting down nodes on demand. (I don't have long living nodes).
Also my understanding is that the parameter reboot-to-reconcile = true
will help someone to fix a long living node to set the Firmware
parameter. Which is not required in my usecase.
Based on that understanding I would say that my fix would be to add the Firmware
parameter in the userData of the karpenter EC2NodeClass
as follows:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: random-name
spec:
amiFamily: Bottlerocket
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
volumeSize: 4Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
iops: 3000
snapshotID: snap-d4758cc7f5f11
throughput: 500
volumeSize: 60Gi
volumeType: gp3
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required
role: KarpenterNodeRole-prod
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod
subnetSelectorTerms:
- tags:
Name: '*Private*'
karpenter.sh/discovery: prod
tags:
nodepool: random-name
purpose: prod
vendor: random-name
userData: |-
[settings.boot.kernel-parameters]
"nvidia.NVreg_EnableGpuFirmware"=["0"]
However, I don't know the internals of that process and maybe my understanding is wrong and I need to use reboot-to-reconcile
setting too.
Please correct me if I am wrong
The reboot-to-reconcile
setting solves an ordering problem in Bottlerocket boot on aws EC2 instances. We can't access user data until the network is available. If anything in user data changes the kernel command line, we need to persist the command line and reboot for the new kernel command line to have any effect. If reboot-to-reconcile
is true and the desired kernel command line is different from the one that Bottlerocket booted with, we reboot. On this second boot, the kernel command line does not change, so we will not reboot (and thus will not enter a reboot loop that prevents the instance from starting).
We intend to add logic to automate this and set the desired kmod option before we load the driver. In general-purpose Linux operating systems, one could solve the problem by putting the desired configuration in /etc/modprobe.d
. The driver is a loadable kmod, so modprobe will find this configuration file if it exists before the kmod is loaded. On a general-purpose Linux machine, the system administrator has access to /etc
, and /etc
persists across boots.
In Bottlerocket, /etc
is not persisted. It is a memory-resident file system (tmpfs) and built during boot by systemd. One can place the driver configuration in the kernel command line even though the driver is not resident; modprobe
reads the command line and adds any configuration it finds to the variables it sourced from /etc/modprobe.d
(or possibly a few other locations).
Hope this helps.
@andrescaroc Is your karpenter solution working? We are facing similar issues with bottlerocket.
Image I'm using: System Info:
What I expected to happen: 100% of the time that in
EKS
I start aBottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6
node on ag4dn.[n]xlarge
instance-type it should expose the gpu count for pods.What actually happened: ~5% of the time that in
EKS
I start aBottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6
node on ag4dn.[n]xlarge
instance-type it didn't expose the gpu count for pods, causing pods requiringnvidia.com/gpu: 1
to not be scheduled, keeping them in pending state waiting for a node.How to reproduce the problem: Note: This issue has existed for more than a year, you can see the slack thread here
Current settings:
apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: name: random-name spec: amiFamily: Bottlerocket blockDeviceMappings:
apiVersion: apps/v1 kind: Deployment metadata: name: deploy-gpu spec: template: spec: containers:
resources
,node labes
andtolerations
NodePool
andEC2NodeClass
karpenter.sh/nodepool=random-name
Node created:
node labes
andtolerations
but not theresources
(gpu)Inspecting the node:
Using the session manager -> amin-container -> sheltie
From the slack thread, someone suggest this: