bottlerocket-os / bottlerocket-core-kit

A kit with core software packaged for Bottlerocket
Other
16 stars 24 forks source link

nvidia-k8s-device-plugin: wait for kubelet.sock before starting #228

Closed cbgbt closed 5 days ago

cbgbt commented 1 week ago

Issue number:

Closes https://github.com/bottlerocket-os/bottlerocket/issues/4250

Description of changes: It is possible for nvidia-k8s-device-plugin and kubelet to race, causing graphics nodes to fail to expose gpus via kubelet.

Specifically, nvidia-k8s-device-plugin starts after kubelet; however, it depends on the kubelet device-plugin management socket to be available in order to register itself.

The kubelet service does not synchronize its start of the device-plugin management socket with its systemd "notify" signal, which means that kubelet may start before the socket is ready.

If the socket is created after nvidia-k8s-device-pluging starts watching the socket for inotify events, it may trigger the device-plugin's restart logic (the device plugin assumes that kubelet has restarted in this case).

Unfortunately, device plugin restarts seem to be somewhat flaky due to issues discussed in https://github.com/bottlerocket-os/bottlerocket/issues/4250.

This change causes the nvidia-k8s-device-plugin to require kubelet.sock to exist as a socket. The unit will fail to start, and subsequently retry every 2 seconds until the socket is available. We perform an initial sleep, because it turns out that kubelet.sock usually does not exist by the time that systemd tries to start nvidia-k8s-device-plugin.

Testing done: I created this patch which causes the inotify race to always occur, which massively increased the incidence of the failure case.

After hundreds of instance launches, I have not witnessed a single instance with missing GPU resources (whereas the failure incidence is ~40% on Bottlerocket 1.25.0 with my faulty patch added).

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.