Open nikoul opened 2 days ago
Thank you for the report. I will try your reproduction recipe and let you know what I see.
My first attempt to reproduce this failed (all instances had a functioning NVIDIA driver), which doesn't say very much if (as seems very likely) this is a race condition between driver load and kubelet start. I will keep looking. One mildly troubling indication is that I do not see any systemd scheduling that ensures that driver load occurs before kubelet start.
If you would like to dig deeper yourself, you might look at this on a failing node:
bash-5.1# journalctl -u load-tesla-kernel-modules
Oct 18 17:40:27 localhost systemd[1]: Starting Load Tesla kernel modules...
Oct 18 17:40:28 ip-192-168-68-128.us-west-2.compute.internal driverdog[1356]: 17:40:28 [INFO] Updated modules dependencies
Oct 18 17:40:29 ip-192-168-68-128.us-west-2.compute.internal driverdog[1356]: 17:40:29 [INFO] Loaded kernel modules
Oct 18 17:40:29 ip-192-168-68-128.us-west-2.compute.internal systemd[1]: Finished Load Tesla kernel modules.
This is the systemd service that loads the NVIDIA driver kmod on a G5 (or similar) instance. If this happens before kubelet start on successful boots and after kubelet start on failed boots, we will have found the race condition you suspected.
I used systemd-analyze dump
to look at the constraints it used during boot on a g5.xlarge with the 1.25.0 AMI, and I believe we can rule out a race between kmod and kubelet:
bash-5.1# systemd-analyze dump kubelet.service
...
After: configured.target (origin-file)
...
bash-5.1# systemd-analyze dump configured.target
...
After: preconfigured.target (origin-file origin-default)
...
bash-5.1# systemd-analyze dump preconfigured.target
...
After: load-tesla-kernel-modules.service (origin-default)
This should ensure that the driver is present before kubelet starts, so at least for now I would rule out a race between these two systemd services.
Any configuration details you could share would be helpful in reproducing this problem. It would be interesting to see the device plugin logging, since that's the source of the kubelet error message you provide. This is what I see on my instances:
bash-5.1# journalctl | grep nvidia-device
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.372837 1558 main.go:199] Starting FS watcher.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.372929 1558 main.go:206] Starting OS watcher.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.373179 1558 main.go:221] Starting Plugins.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.373193 1558 main.go:278] Loading configuration.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386371 1558 main.go:303] Updating config with default resource matching patterns.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386452 1558 main.go:314]
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: Running with config:
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "version": "v1",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "flags": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "migStrategy": "none",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "failOnInitError": true,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "mpsRoot": "",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "nvidiaDriverRoot": "/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "nvidiaDevRoot": "/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "gdsEnabled": false,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "mofedEnabled": false,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "useNodeFeatureAPI": null,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "deviceDiscoveryStrategy": "auto",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "plugin": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "passDeviceSpecs": true,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "deviceListStrategy": [
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "volume-mounts"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: ],
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "deviceIDStrategy": "index",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "containerDriverRoot": "/driver-root"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: },
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "resources": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "gpus": [
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "pattern": "*",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "name": "nvidia.com/gpu"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: ]
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: },
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "sharing": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "timeSlicing": {}
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386468 1558 main.go:317] Retrieving plugins.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.541469 1558 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.546032 1558 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.553200 1558 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
There is one possibly-relevant change in 1.25.0. We added support for GPU time-slicing (as seen in https://github.com/bottlerocket-os/bottlerocket/pull/4230). Normally I would expect the nvidia-device-plugin
process to run until node shutdown. If the plugin crashes or shuts down for any reason, kubelet will see an unexpected eof on the plugin socket and print that ListAndWatch ended unexpectedly
error message. If you can, please share the nvidia-device-plugin
logging from one of your failed instances.
Hello again, and thanks for investigating this! I've retrieved some logs from a faulty node, here they are:
bash-5.1# journalctl -u load-tesla-kernel-modules
Oct 17 12:05:19 localhost systemd[1]: Starting Load Tesla kernel modules...
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal driverdog[1375]: 12:05:21 [INFO] Updated modules dependencies
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal driverdog[1375]: 12:05:21 [INFO] Loaded kernel modules
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal systemd[1]: Finished Load Tesla kernel modules.
bash-5.1# journalctl -u kubelet
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal systemd[1]: Starting Kubelet...
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal ctr[1639]: unpacking localhost/kubernetes/pause:0.1.0 (sha256:e287062022dab2bcd00be3aa5b65a10318c93c3426cc7b72126a2108d28c47aa)...done
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal ctr[1648]: io.cri-containerd.image=managed,io.cri-containerd.pinned=pinned
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --register-with-taints has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.004911 1655 server.go:205] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.012604 1655 server.go:484] "Kubelet version" kubeletVersion="v1.30.1-eks-e564799"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.012638 1655 server.go:486] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.030358 1655 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.062024 1655 container_manager_linux.go:265] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.062066 1655 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"NodeName":"ip-xx-xx-x-xx.ec2.internal","RuntimeCgroupsName":"/runtime.slice/containerd.service","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"systemd","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":true,"KubeReservedCgroupName":"/runtime","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":{"cpu":"80m","ephemeral-storage":"1Gi","memory":"893Mi"},"SystemReserved":null,"HardEvictionThresholds":[{"Signal":"memory.available","Operator":"LessThan","Value":{"Quantity":"100Mi","Percentage":0},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.1},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.15},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null}],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":1048576,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.063928 1655 topology_manager.go:138] "Creating topology manager with none policy"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.063947 1655 container_manager_linux.go:301] "Creating device plugin manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.066091 1655 state_mem.go:36] "Initialized new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069168 1655 kubelet.go:400] "Attempting to sync node with API server"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069195 1655 kubelet.go:301] "Adding static pod path" path="/etc/kubernetes/static-pods/"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069248 1655 kubelet.go:312] "Adding apiserver pod source"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.070036 1655 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.073362 1655 kuberuntime_manager.go:261] "Container runtime initialized" containerRuntime="containerd" version="1.7.22+bottlerocket" apiVersion="v1"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.079193 1655 kubelet.go:815] "Not starting ClusterTrustBundle informer because we are in static kubelet mode"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: W1017 12:05:33.081030 1655 probe.go:272] Flexvolume plugin directory at /var/lib/kubelet/plugins/volume/exec does not exist. Recreating.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.082417 1655 server.go:1264] "Started kubelet"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal systemd[1]: Started Kubelet.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087003 1655 ratelimit.go:55] "Setting rate limiting for endpoint" service="podresources"qps=100 burstTokens=10
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087388 1655 server.go:227] "Starting to serve the podresources API" endpoint="unix:/var/lib/kubelet/pod-resources/kubelet.sock"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087484 1655 server.go:163] "Starting to listen" address="0.0.0.0" port=10250
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.088330 1655 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.089641 1655 server.go:455] "Adding debug handlers to kubelet server"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.090462 1655 kubelet.go:1467] "Image garbage collection failed once. Stats initializationmay not have completed yet" err="invalid capacity 0 on image filesystem"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090469 1655 volume_manager.go:291] "Starting Kubelet Volume Manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090500 1655 desired_state_of_world_populator.go:149] "Desired state populator starts to run"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090824 1655 reconciler.go:26] "Reconciler: start to sync state"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093738 1655 factory.go:221] Registration of the containerd container factory successfully
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093765 1655 factory.go:221] Registration of the systemd container factory successfully
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093862 1655 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110556 1655 cpu_manager.go:214] "Starting CPU manager" policy="none"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110585 1655 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110606 1655 state_mem.go:36] "Initialized new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.136374 1655 policy_none.go:49] "None policy: Start"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.137607 1655 memory_manager.go:170] "Starting memorymanager" policy="None"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.137642 1655 state_mem.go:35] "Initializing new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.138216 1655 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144592 1655 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144778 1655 status_manager.go:217] "Starting to sync pod status with apiserver"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144850 1655 kubelet.go:2337] "Starting kubelet main sync loop"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.144903 1655 kubelet.go:2361] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.191233 1655 kubelet_node_status.go:73] "Attempting to register node" node="ip-xx-xx-x-xx.ec2.internal"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.246414 1655 kubelet.go:2361] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250055 1655 manager.go:479] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250238 1655 container_log_manager.go:186] "Initializing container log rotate workers" workers=1 monitorPeriod="10s"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250347 1655 plugin_manager.go:118] "Starting Kubelet Plugin Manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.254400 1655 eviction_manager.go:282] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-xx-xx-x-xx.ec2.internal\" not found"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.309607 1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.325133 1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.335816 1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.378294 1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Canceled desc = grpc: the client connection is closing" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.734727 1655 kubelet_node_status.go:76] "Successfully registered node" node="ip-xx-xx-x-xx.ec2.internal"
Oct 17 12:05:34 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:34.070697 1655 apiserver.go:52] "Watching apiserver"
bash-5.1# journalctl | grep nvidia-device
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.230706 1667 main.go:199] Starting FS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.231641 1667 main.go:206] Starting OS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.232173 1667 main.go:221] Starting Plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.232186 1667 main.go:278] Loading configuration.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236537 1667 main.go:303] Updating config with default resource matching patterns.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236586 1667 main.go:314]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: Running with config:
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "version": "v1",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "flags": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "migStrategy": "none",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "failOnInitError": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "mpsRoot": "",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaDriverRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaDevRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "gdsEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "mofedEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "useNodeFeatureAPI": null,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceDiscoveryStrategy": "auto",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "plugin": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "passDeviceSpecs": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceListStrategy": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "volume-mounts"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: ],
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceIDStrategy": "index",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "containerDriverRoot": "/driver-root"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "resources": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "gpus": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "pattern": "*",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "name": "nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: ]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "sharing": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "timeSlicing": {}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236597 1667 main.go:317] Retrieving plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.299855 1667 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.305466 1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312453 1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312491 1667 main.go:246] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312501 1667 main.go:353] Stopping plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312507 1667 server.go:185] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313088 1667 main.go:221] Starting Plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313097 1667 main.go:278] Loading configuration.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313542 1667 main.go:303] Updating config with default resource matching patterns.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313587 1667 main.go:314]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: Running with config:
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "version": "v1",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "flags": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "migStrategy": "none",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "failOnInitError": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "mpsRoot": "",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaDriverRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaDevRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "gdsEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "mofedEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "useNodeFeatureAPI": null,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceDiscoveryStrategy": "auto",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "plugin": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "passDeviceSpecs": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceListStrategy": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "volume-mounts"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: ],
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceIDStrategy": "index",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "containerDriverRoot": "/driver-root"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "resources": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "gpus": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "pattern": "*",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "name": "nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: ]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "sharing": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "timeSlicing": {}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313597 1667 main.go:317] Retrieving plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.323855 1667 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.324374 1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.326374 1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I'm keeping that node around in case you have other ideas of logs that could help. Thanks again!
Description
We have been using Bottlerocket 1.25 for the past 3 days. Since the upgrade, some of our GPU nodes (
g5.xlarge
) are failing to initialize the NVIDIA driver on Kubernetes 1.30. The affected image isbottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.25.0-388e1050
.Issue
When the driver fails to initialize, the node does not advertise its
nvidia.com/gpu
capacity, causing the corresponding pods to remain in a Pending state indefinitely. This prevents any pods scheduled on the affected node from starting. The kubelet log on the admin container of an affected node contains the following error:Additional Information
Unknown
state indefinitely and shows the following condition message:Expected Behavior
The NVIDIA driver should initialize correctly on all GPU nodes, and the node should advertise its GPU capacity to ensure proper scheduling.
Steps to Reproduce
g5.xlarge
instances in a Kubernetes 1.30 cluster.