Description of changes:
It is possible for nvidia-k8s-device-plugin and kubelet to race, causing graphics nodes to fail to expose gpus via kubelet.
Specifically, nvidia-k8s-device-plugin starts after kubelet; however, it depends on the kubelet device-plugin management socket to be available in order to register itself.
The kubelet service does not synchronize its start of the device-plugin management socket with its systemd "notify" signal, which means that kubelet may start before the socket is ready.
If the socket is created afternvidia-k8s-device-pluging starts watching the socket for inotify events, it may trigger the device-plugin's restart logic (the device plugin assumes that kubelet has restarted in this case).
This change causes the nvidia-k8s-device-plugin to require kubelet.sock to exist as a socket. The unit will fail to start, and subsequently retry every 2 seconds until the socket is available. We perform an initial sleep, because it turns out that kubelet.sock usually does not exist by the time that systemd tries to start nvidia-k8s-device-plugin.
Testing done:
I created this patch which causes the inotify race to always occur, which massively increased the incidence of the failure case.
After hundreds of instance launches, I have not witnessed a single instance with missing GPU resources (whereas the failure incidence is ~40% on Bottlerocket 1.25.0 with my faulty patch added).
[x] basic node readiness tests
[x] cycle over 1000 instance launches without triggering the bug
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.
Issue number:
Closes https://github.com/bottlerocket-os/bottlerocket/issues/4250
Description of changes: It is possible for
nvidia-k8s-device-plugin
andkubelet
to race, causing graphics nodes to fail to expose gpus via kubelet.Specifically,
nvidia-k8s-device-plugin
starts afterkubelet
; however, it depends on thekubelet
device-plugin management socket to be available in order to register itself.The
kubelet
service does not synchronize its start of the device-plugin management socket with its systemd "notify" signal, which means thatkubelet
may start before the socket is ready.If the socket is created after
nvidia-k8s-device-pluging
starts watching the socket for inotify events, it may trigger the device-plugin's restart logic (the device plugin assumes that kubelet has restarted in this case).Unfortunately, device plugin restarts seem to be somewhat flaky due to issues discussed in https://github.com/bottlerocket-os/bottlerocket/issues/4250.
This change causes the
nvidia-k8s-device-plugin
to requirekubelet.sock
to exist as a socket. The unit will fail to start, and subsequently retry every 2 seconds until the socket is available. We perform an initial sleep, because it turns out thatkubelet.sock
usually does not exist by the time that systemd tries to startnvidia-k8s-device-plugin
.Testing done: I created this patch which causes the inotify race to always occur, which massively increased the incidence of the failure case.
After hundreds of instance launches, I have not witnessed a single instance with missing GPU resources (whereas the failure incidence is ~40% on Bottlerocket 1.25.0 with my faulty patch added).
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.