bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.69k stars 512 forks source link

Bottlerocket 1.25 is preventing nvidia driver to start #4250

Open nikoul opened 2 days ago

nikoul commented 2 days ago

Description

We have been using Bottlerocket 1.25 for the past 3 days. Since the upgrade, some of our GPU nodes (g5.xlarge) are failing to initialize the NVIDIA driver on Kubernetes 1.30. The affected image is bottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.25.0-388e1050.

Issue

When the driver fails to initialize, the node does not advertise its nvidia.com/gpu capacity, causing the corresponding pods to remain in a Pending state indefinitely. This prevents any pods scheduled on the affected node from starting. The kubelet log on the admin container of an affected node contains the following error:

  Oct 17 12:05:33 ip-xx-xx-xx-xx.ec2.internal kubelet[1655]: E1017 12:05:33.335816    1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"

Additional Information

Expected Behavior

The NVIDIA driver should initialize correctly on all GPU nodes, and the node should advertise its GPU capacity to ensure proper scheduling.

Steps to Reproduce

  1. Deploy Bottlerocket 1.25 on 6 (or more) g5.xlarge instances in a Kubernetes 1.30 cluster.
  2. Observe the node initialization process and monitor whether the NVIDIA driver is initialized correctly.
  3. Check the node’s GPU capacity and NodeClaim status.
larvacea commented 2 days ago

Thank you for the report. I will try your reproduction recipe and let you know what I see.

larvacea commented 2 days ago

My first attempt to reproduce this failed (all instances had a functioning NVIDIA driver), which doesn't say very much if (as seems very likely) this is a race condition between driver load and kubelet start. I will keep looking. One mildly troubling indication is that I do not see any systemd scheduling that ensures that driver load occurs before kubelet start.

If you would like to dig deeper yourself, you might look at this on a failing node:

bash-5.1# journalctl -u load-tesla-kernel-modules
Oct 18 17:40:27 localhost systemd[1]: Starting Load Tesla kernel modules...
Oct 18 17:40:28 ip-192-168-68-128.us-west-2.compute.internal driverdog[1356]: 17:40:28 [INFO] Updated modules dependencies
Oct 18 17:40:29 ip-192-168-68-128.us-west-2.compute.internal driverdog[1356]: 17:40:29 [INFO] Loaded kernel modules
Oct 18 17:40:29 ip-192-168-68-128.us-west-2.compute.internal systemd[1]: Finished Load Tesla kernel modules.

This is the systemd service that loads the NVIDIA driver kmod on a G5 (or similar) instance. If this happens before kubelet start on successful boots and after kubelet start on failed boots, we will have found the race condition you suspected.

larvacea commented 2 days ago

I used systemd-analyze dump to look at the constraints it used during boot on a g5.xlarge with the 1.25.0 AMI, and I believe we can rule out a race between kmod and kubelet:

bash-5.1# systemd-analyze dump kubelet.service
...
        After: configured.target (origin-file)
...
bash-5.1# systemd-analyze dump configured.target
...
        After: preconfigured.target (origin-file origin-default)
...
bash-5.1# systemd-analyze dump preconfigured.target
...
        After: load-tesla-kernel-modules.service (origin-default)

This should ensure that the driver is present before kubelet starts, so at least for now I would rule out a race between these two systemd services.

larvacea commented 2 days ago

Any configuration details you could share would be helpful in reproducing this problem. It would be interesting to see the device plugin logging, since that's the source of the kubelet error message you provide. This is what I see on my instances:

bash-5.1# journalctl | grep nvidia-device
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.372837    1558 main.go:199] Starting FS watcher.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.372929    1558 main.go:206] Starting OS watcher.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.373179    1558 main.go:221] Starting Plugins.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.373193    1558 main.go:278] Loading configuration.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386371    1558 main.go:303] Updating config with default resource matching patterns.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386452    1558 main.go:314]
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: Running with config:
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   "version": "v1",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   "flags": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "migStrategy": "none",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "failOnInitError": true,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "mpsRoot": "",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "nvidiaDriverRoot": "/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "nvidiaDevRoot": "/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "gdsEnabled": false,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "mofedEnabled": false,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "useNodeFeatureAPI": null,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "deviceDiscoveryStrategy": "auto",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "plugin": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "passDeviceSpecs": true,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "deviceListStrategy": [
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:         "volume-mounts"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       ],
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "deviceIDStrategy": "index",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "containerDriverRoot": "/driver-root"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   },
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   "resources": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "gpus": [
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:         "pattern": "*",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:         "name": "nvidia.com/gpu"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     ]
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   },
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   "sharing": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "timeSlicing": {}
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386468    1558 main.go:317] Retrieving plugins.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.541469    1558 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.546032    1558 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.553200    1558 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

There is one possibly-relevant change in 1.25.0. We added support for GPU time-slicing (as seen in https://github.com/bottlerocket-os/bottlerocket/pull/4230). Normally I would expect the nvidia-device-plugin process to run until node shutdown. If the plugin crashes or shuts down for any reason, kubelet will see an unexpected eof on the plugin socket and print that ListAndWatch ended unexpectedly error message. If you can, please share the nvidia-device-plugin logging from one of your failed instances.

nikoul commented 1 day ago

Hello again, and thanks for investigating this! I've retrieved some logs from a faulty node, here they are:

bash-5.1# journalctl -u load-tesla-kernel-modules
Oct 17 12:05:19 localhost systemd[1]: Starting Load Tesla kernel modules...
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal driverdog[1375]: 12:05:21 [INFO] Updated modules dependencies
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal driverdog[1375]: 12:05:21 [INFO] Loaded kernel modules
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal systemd[1]: Finished Load Tesla kernel modules.
bash-5.1# journalctl -u kubelet
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal systemd[1]: Starting Kubelet...
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal ctr[1639]: unpacking localhost/kubernetes/pause:0.1.0 (sha256:e287062022dab2bcd00be3aa5b65a10318c93c3426cc7b72126a2108d28c47aa)...done
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal ctr[1648]: io.cri-containerd.image=managed,io.cri-containerd.pinned=pinned
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --register-with-taints has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.004911    1655 server.go:205] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.012604    1655 server.go:484] "Kubelet version" kubeletVersion="v1.30.1-eks-e564799"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.012638    1655 server.go:486] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.030358    1655 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.062024    1655 container_manager_linux.go:265] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.062066    1655 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"NodeName":"ip-xx-xx-x-xx.ec2.internal","RuntimeCgroupsName":"/runtime.slice/containerd.service","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"systemd","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":true,"KubeReservedCgroupName":"/runtime","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":{"cpu":"80m","ephemeral-storage":"1Gi","memory":"893Mi"},"SystemReserved":null,"HardEvictionThresholds":[{"Signal":"memory.available","Operator":"LessThan","Value":{"Quantity":"100Mi","Percentage":0},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.1},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.15},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null}],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":1048576,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.063928    1655 topology_manager.go:138] "Creating topology manager with none policy"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.063947    1655 container_manager_linux.go:301] "Creating device plugin manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.066091    1655 state_mem.go:36] "Initialized new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069168    1655 kubelet.go:400] "Attempting to sync node with API server"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069195    1655 kubelet.go:301] "Adding static pod path" path="/etc/kubernetes/static-pods/"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069248    1655 kubelet.go:312] "Adding apiserver pod source"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.070036    1655 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.073362    1655 kuberuntime_manager.go:261] "Container runtime initialized" containerRuntime="containerd" version="1.7.22+bottlerocket" apiVersion="v1"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.079193    1655 kubelet.go:815] "Not starting ClusterTrustBundle informer because we are in static kubelet mode"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: W1017 12:05:33.081030    1655 probe.go:272] Flexvolume plugin directory at /var/lib/kubelet/plugins/volume/exec does not exist. Recreating.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.082417    1655 server.go:1264] "Started kubelet"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal systemd[1]: Started Kubelet.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087003    1655 ratelimit.go:55] "Setting rate limiting for endpoint" service="podresources"qps=100 burstTokens=10
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087388    1655 server.go:227] "Starting to serve the podresources API" endpoint="unix:/var/lib/kubelet/pod-resources/kubelet.sock"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087484    1655 server.go:163] "Starting to listen" address="0.0.0.0" port=10250
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.088330    1655 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.089641    1655 server.go:455] "Adding debug handlers to kubelet server"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.090462    1655 kubelet.go:1467] "Image garbage collection failed once. Stats initializationmay not have completed yet" err="invalid capacity 0 on image filesystem"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090469    1655 volume_manager.go:291] "Starting Kubelet Volume Manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090500    1655 desired_state_of_world_populator.go:149] "Desired state populator starts to run"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090824    1655 reconciler.go:26] "Reconciler: start to sync state"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093738    1655 factory.go:221] Registration of the containerd container factory successfully
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093765    1655 factory.go:221] Registration of the systemd container factory successfully
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093862    1655 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110556    1655 cpu_manager.go:214] "Starting CPU manager" policy="none"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110585    1655 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110606    1655 state_mem.go:36] "Initialized new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.136374    1655 policy_none.go:49] "None policy: Start"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.137607    1655 memory_manager.go:170] "Starting memorymanager" policy="None"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.137642    1655 state_mem.go:35] "Initializing new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.138216    1655 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144592    1655 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144778    1655 status_manager.go:217] "Starting to sync pod status with apiserver"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144850    1655 kubelet.go:2337] "Starting kubelet main sync loop"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.144903    1655 kubelet.go:2361] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.191233    1655 kubelet_node_status.go:73] "Attempting to register node" node="ip-xx-xx-x-xx.ec2.internal"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.246414    1655 kubelet.go:2361] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250055    1655 manager.go:479] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250238    1655 container_log_manager.go:186] "Initializing container log rotate workers" workers=1 monitorPeriod="10s"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250347    1655 plugin_manager.go:118] "Starting Kubelet Plugin Manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.254400    1655 eviction_manager.go:282] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-xx-xx-x-xx.ec2.internal\" not found"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.309607    1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.325133    1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.335816    1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.378294    1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Canceled desc = grpc: the client connection is closing" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.734727    1655 kubelet_node_status.go:76] "Successfully registered node" node="ip-xx-xx-x-xx.ec2.internal"
Oct 17 12:05:34 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:34.070697    1655 apiserver.go:52] "Watching apiserver"
bash-5.1# journalctl | grep nvidia-device
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.230706    1667 main.go:199] Starting FS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.231641    1667 main.go:206] Starting OS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.232173    1667 main.go:221] Starting Plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.232186    1667 main.go:278] Loading configuration.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236537    1667 main.go:303] Updating config with default resource matching patterns.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236586    1667 main.go:314]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: Running with config:
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "version": "v1",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "flags": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "migStrategy": "none",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "failOnInitError": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "mpsRoot": "",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "nvidiaDriverRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "nvidiaDevRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "gdsEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "mofedEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "useNodeFeatureAPI": null,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "deviceDiscoveryStrategy": "auto",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "plugin": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "passDeviceSpecs": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "deviceListStrategy": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "volume-mounts"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       ],
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "deviceIDStrategy": "index",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "containerDriverRoot": "/driver-root"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "resources": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "gpus": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "pattern": "*",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "name": "nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     ]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "sharing": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "timeSlicing": {}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236597    1667 main.go:317] Retrieving plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.299855    1667 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.305466    1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312453    1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312491    1667 main.go:246] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312501    1667 main.go:353] Stopping plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312507    1667 server.go:185] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313088    1667 main.go:221] Starting Plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313097    1667 main.go:278] Loading configuration.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313542    1667 main.go:303] Updating config with default resource matching patterns.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313587    1667 main.go:314]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: Running with config:
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "version": "v1",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "flags": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "migStrategy": "none",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "failOnInitError": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "mpsRoot": "",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "nvidiaDriverRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "nvidiaDevRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "gdsEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "mofedEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "useNodeFeatureAPI": null,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "deviceDiscoveryStrategy": "auto",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "plugin": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "passDeviceSpecs": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "deviceListStrategy": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "volume-mounts"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       ],
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "deviceIDStrategy": "index",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "containerDriverRoot": "/driver-root"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "resources": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "gpus": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "pattern": "*",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "name": "nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     ]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "sharing": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "timeSlicing": {}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313597    1667 main.go:317] Retrieving plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.323855    1667 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.324374    1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.326374    1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

I'm keeping that node around in case you have other ideas of logs that could help. Thanks again!