Closed sebas-w closed 1 month ago
Also experiencing this issue. Lots of pods get stuck in Pending or Terminating states on nodes upgraded to this version.
We're investigating. If you can identify a process that's leaking file descriptors post-update, that would be a big help. In addition, if you can identify the variant (which is part of the AMI's name, e.g. bottlerocket-aws-k8s-1.29-x86_64-v1.22.0-b6684b72 is the aws-k8s-1.29 variant, and there is a similarly-named bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.22.0-b6684b72 AMI and variant for kubernetes 1.29 with NVIDIA drivers on the x86_64 platform). Also, if there's anything non-default in ulimits, cgroups, or the like that you think might be helpful.
@larvacea,
Using a script and method shared with me in a shell to access the node with the command:
k debug node/<node name> -it --image=busybox
for pid in /proc/[0-9]*; do
echo "$(ls -l $pid/fd 2>/dev/null | wc -l) $pid"
done | sort -nr | head -n 10
I get the output:
423476 /proc/1984
419 /proc/1915
206 /proc/62867
62 /proc/7590
52 /proc/1
50 /proc/5452
50 /proc/2164
50 /proc/15879
40 /proc/178565
39 /proc/2161
checking for /proc/1984
i see it is 1984 root 2h46 /usr/bin/kubelet
comparing this to another node running the old AMI I see kubelet is running normally:
ps auwx | grep kubelet
1238 root 1d01 /usr/bin/kubelet
pid=1238
if [ -d /proc/$pid ]; then
echo "$(ls -l /proc/$pid/fd 2>/dev/null | wc -l) /proc/$pid"
else
echo "Process ID $pid does not exist."
fi
That gives the output: 33 /proc/1238
Running much more tame than the other kublet process on the updated nodes.
AMI ID: ami-0bc7c40e3f8155e66
AMI NAME: bottlerocket-aws-k8s-1.28-x86_64-v1.22.0-b6684b72
The only options we include are to overcome a problem with the shared PVC's. There's an issue where it won't respect the option for multiple pods to mount a PVC volume on the same host, it will mount to the first fine, when the second pod mounts it, it prevents teh first from accessing until a restart at which point the second then loses it's access. we pass these in as bootstrap_extra_args
in the tf module we use to deploy the cluster.
Those options are:
[settings.boot]
reboot-to-reconcile = true
[settings.boot.init-parameters]
"systemd.unified_cgroup_hierarchy" = ["0"]
I ~think that covers all that you asked, if I missed anything please let me know!
That's great! Very helpful and the fds on kubelet do look troubling.
I am also experiencing this issue in 6 clusters. I am using karpenter with the following setting, and it is picking up the 1.22 version of bottlerocket
spec:
amiSelectorTerms:
- alias: bottlerocket@latest
It is affecting nodes with gpu and without it. (So I think the issue is not related with that)
This is what I get from the Events of the nodes that turn into NotReady
state:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Unconsolidatable 27m (x116 over 35h) karpenter SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
Warning ContainerGCFailed 17m kubelet failed to read podLogsRootDirectory "/var/log/pods": open /var/log/pods: too many open files
Normal NodeNotReady 16m (x2 over 2d9h) node-controller Node ip-192-168-145-111.eu-central-1.compute.internal status is now: NodeNotReady
Normal DisruptionBlocked 13m (x3 over 17m) karpenter Cannot disrupt Node: pdb "knative-eventing/eventing-webhook" prevents pod evictions
You can see more details of my issue here
@larvacea,
Using a script and method shared with me in a shell to access the node with the command:
k debug node/<node name> -it --image=busybox
for pid in /proc/[0-9]*; do echo "$(ls -l $pid/fd 2>/dev/null | wc -l) $pid" done | sort -nr | head -n 10
I get the output:
423476 /proc/1984 419 /proc/1915 206 /proc/62867 62 /proc/7590 52 /proc/1 50 /proc/5452 50 /proc/2164 50 /proc/15879 40 /proc/178565 39 /proc/2161
checking for
/proc/1984
i see it is1984 root 2h46 /usr/bin/kubelet
comparing this to another node running the old AMI I see kubelet is running normally:
ps auwx | grep kubelet 1238 root 1d01 /usr/bin/kubelet
pid=1238 if [ -d /proc/$pid ]; then echo "$(ls -l /proc/$pid/fd 2>/dev/null | wc -l) /proc/$pid" else echo "Process ID $pid does not exist." fi
That gives the output:
33 /proc/1238
Running much more tame than the other kublet process on the updated nodes.
AMI ID:
ami-0bc7c40e3f8155e66
AMI NAME:bottlerocket-aws-k8s-1.28-x86_64-v1.22.0-b6684b72
The only options we include are to overcome a problem with the shared PVC's. There's an issue where it won't respect the option for multiple pods to mount a PVC volume on the same host, it will mount to the first fine, when the second pod mounts it, it prevents teh first from accessing until a restart at which point the second then loses it's access. we pass these in as
bootstrap_extra_args
in the tf module we use to deploy the cluster.Those options are:
[settings.boot] reboot-to-reconcile = true [settings.boot.init-parameters] "systemd.unified_cgroup_hierarchy" = ["0"]
I ~think that covers all that you asked, if I missed anything please let me know!
Can confirm same behavior, but using image suited for Kube 1.29 on EKS. Nodes running on previous image doesn't show the issue.
I do not have any special configuration in my clusters.
Fyi: If you are able to, bumping to 1.30 seems to fix it.
Thank you for the report that 1.30 does not exhibit this behavior. Not only is it a helpful suggestion for mitigating the problem, it helps focus our attention on the differences between the affected versions and 1.30.
@larvacea, to add my further findings:
Our cluster was using Calico, we notated in the logs the following:
Sep 17 15:50:00 kubelet[1992]: E0917 15:50:00.037705 1992 plugins.go:723] "Error dynamically probing plugins" err="error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input"
Sep 17 15:50:00 kubelet[1992]: E0917 15:50:00.038780 1992 driver-call.go:262] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
Sep 17 15:50:00 kubelet[1992]: W0917 15:50:00.038803 1992 driver-call.go:149] FlexVolume: driver call failed: executable: /var/lib/kubelet/plugins/volume/exec/nodeagent~uds/uds, args: [init], error: executable file not found in $PATH, output: ""
This seems to be similar to the issues:
This led us to test removing Calico from one of the clusters with surprising results. The # of allocated FDs went back to normal for the kubelet process.
Before
After
This seems to indicate that it is possibly a bug in kubelet. I tried digging through the release notes for kubelet but nothing is jumping out at me as immediately obvious.
I can reproduce the problem. First impression is that there is an unfortunate interaction between FlexVolume and SELinux:
Sep 18 17:54:38 ip-192-168-34-103.us-west-1.compute.internal kubelet[1248]: W0918 17:54:38.791964 1248 driver-call.go:149] FlexVolume: driver call failed: executable: /var/lib/kubelet/plugins/volume/exec/nodeagent~uds/uds, args: [init], error: executable file not found in $PATH, output: ""
Sep 18 17:54:38 ip-192-168-34-103.us-west-1.compute.internal kubelet[1248]: E0918 17:54:38.791992 1248 plugins.go:723] "Error dynamically probing plugins" err="error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input"
Sep 18 17:54:38 ip-192-168-34-103.us-west-1.compute.internal audit[9751]: AVC avc: denied { execute } for pid=9751 comm="kubelet" name="uds" dev="nvme1n1p1" ino=134460038 scontext=system_u:system_r:system_t:s0 tcontext=system_u:object_r:local_t:s0 tclass=file permissive=0
This failure repeats in a tight loop, leaving an open fd behind, and eventually kubelet
devours the node.
Thank you all for your helpful reports.
I am still working to find the root cause of the file descriptor leak.
In the meantime, I would recommend disabling Calico's FlexVolume if you are not using it. At present it will result in a tight retry loop in kubelet, as SELinux denies kubelet permission to run this plugin uds binary. Bottlerocket's SELinux policies disallow the execution of unknown binaries.
On the other hand, if you are using or would like to use FlexVolume, do open an issue.
In the 1.22.0 release Bottlerocket updated the version of Go used to build most Go components to Go 1.23, including all versions of kubernetes code prior to k8s 1.30. It appears that Go 1.23 has an issue where a failed call to os/exec
will leak a file descriptor. The Calico FlexVolume feature results in a tight retry loop in kubelet that calls exec
and fails very quickly. Any code that calls exec
will leak a file descriptor; Calico just made the failure obvious.
I am still working to find the root cause of the file descriptor leak.
In the meantime, I would recommend disabling Calico's FlexVolume if you are not using it. At present it will result in a tight retry loop in kubelet, as SELinux denies kubelet permission to run this plugin uds binary. Bottlerocket's SELinux policies disallow the execution of unknown binaries.
On the other hand, if you are using or would like to use FlexVolume, do open an issue.
@larvacea Thank you for this info! What's the appropriate way to update this value? We've updated the tigera-operator
's helm value spec.flexVolumePath=None
per the documentation here, but it doesn't seem to be updating the Installation
resource or the Calico related components.
I am still working to find the root cause of the file descriptor leak. In the meantime, I would recommend disabling Calico's FlexVolume if you are not using it. At present it will result in a tight retry loop in kubelet, as SELinux denies kubelet permission to run this plugin uds binary. Bottlerocket's SELinux policies disallow the execution of unknown binaries. On the other hand, if you are using or would like to use FlexVolume, do open an issue.
@larvacea Thank you for this info! What's the appropriate way to update this value? We've updated the
tigera-operator
's helm valuespec.flexVolumePath=None
per the documentation here, but it doesn't seem to be updating theInstallation
resource or the Calico related components.
So, when Installation is deployed even if update values from Helm chart doesn't update resources (should be calico controller). You need do it manually:
kubectl patch installation <installation-name> --type=merge -p '{"spec": {"flexVolumePath": "None"}}'
calico-node
(Optional) Instead restart all nodes, maybe you can restart one calico-node
and check if node works properly. After that, rollout DaemonSet
kubectl -n <namespace> rollout restart daemonset/calico-node
Ensure only have calico-node
and install-cni
containers, flexvol-driver
won't appear.
In the 1.22.0 release Bottlerocket updated the version of Go used to build most Go components to Go 1.23, including all versions of kubernetes code prior to k8s 1.30. It appears that Go 1.23 has an issue where a failed call to
os/exec
will leak a file descriptor. The Calico FlexVolume feature results in a tight retry loop in kubelet that callsexec
and fails very quickly. Any code that callsexec
will leak a file descriptor; Calico just made the failure obvious.
this looks to have been fixed in golang
I expect that the backported Go fix will be in the next Go 1.23.x release, as @perplexa notes. For now, I have pinned the build for all kubernetes versions from 1.23 through 1.29 to Go 1.22.x (1.30 was already pinned to 1.22.x) and verified that this mitigates the problem. I expect this will be part of the next bottlerocket-core-kit release, which should then be part of the next bottlerocket release.
It's worth mentioning again that the Calico FlexVolume plugin is spinning in an error-retry loop in kubelet when enabled, and so one might wish to disable it (if unused) to save a bit of CPU time and journald thrashing.
Thanks for all your helpful input! It made root-causing and fixing much easier than I had first feared.
In the 1.22.0 release Bottlerocket updated the version of Go used to build most Go components to Go 1.23, including all versions of kubernetes code prior to k8s 1.30. It appears that Go 1.23 has an issue where a failed call to
os/exec
will leak a file descriptor. The Calico FlexVolume feature results in a tight retry loop in kubelet that callsexec
and fails very quickly. Any code that callsexec
will leak a file descriptor; Calico just made the failure obvious.
Hi @larvacea Amazing findings! Thanks a lot! Just curious, would you mind sharing the troubleshooting steps, how did you navigate to the root cause? I think this is a great opportunity to learn the troubleshooting skills, methods, ...etc, as I can only see kubelet process open so many files on Bottlerocket OS 1.22, but on prior version it does not, even with the same calico configs and dmesg errors. Though it must be something related with OS 1.22, but just can't figure out the 'related thing'.
Once I heard that Calico would help reproduce this behavior, the first step was to create a tiny cluster with Calico, add a node group with an affected AMI, and launch a do-nothing cron task to just launch as many pods as possible.
This immediately showed that rapid growth in open fds, and examining proc/*/fd
showed that kubelet was the process responsible.
There were two changes that seemed possible sources of this behavior. We updated the cni package, and we changed the version of Go used to build the kubernetes packages. Bisection showed that cni was blameless, and Go 1.23.1 was the culprit.
Looking at journalctl -u kubelet
revealed the error-retry loop. There are two Bottlerocket features that can trip up software written for general-purpose Linux distros, our immutable root filesystem and SElinux permissions. SElinux is visible in journalctl
so I looked there first and saw the denied
messages from each attempt to launch the plugin.
Given all this, it was easy to find the known issue in Go.
@larvacea, Awesome! thanks so much in persisting to find the cause here!
Do you happen to know when there might be a fixed version released? We're holding on any further upgrades or rollouts for our clusters as the majority of them use the Bottlerocket OS and if we upgraded more with this issue still present we'd have to swap back to Linux to avoid it.
I am sorry, but by policy I can't comment on future dates. The fix has been released in the bottlerocket-core-kit repository. You can get a preview of the next operating system release by looking there.
Updating to k8s 1.30 fixed the issue for us. node_filefd_allocated
is rather stable
Bottlerocket v1.23.0 has been released: https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.23.0
This OS release includes the fix for this issue from Bottlerocket Core-kit v2.7.0: https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/155
Image I'm using:
1.22.0-b6684b72
What I expected to happen:
Apps should function the same as when using the AMI image
1.21.1-82691b51
.What actually happened: After the AMI update, we saw that the metric
node_filefd_allocated
increased dramatically after the AMI image update. This was the only change made within our clusters at the time.With the max being hit, the nodes are going into a Not Ready status and applications are left without the ability to run and it seems to have issues self healing when this problem occurs as well.
Initial issue metrics:
my test on another cluster to verify change comes with latest Bottlerocket OS AMI
1.22.0-b6684b72
:I tested this by updating the AMI's on another one of our nodes that had not been updated, and the problem started occuring. The workloads between the two clusters are vastly different so it is clear this is not caused by one of the applications running on the clusters we observed this behavior on.
How to reproduce the problem:
Update to latest Bottlerocket OS AMI image and check the values from the node exporter for the metrics
node_filefd_allocated
How did I work around it:
I had to swap to using Linux AMI's instead of Bottlerocket AMI's in order to mitigate the issue on our production cluster. Seeing no other way to solve this as it seems introduced in latest Bottlerocket AMI.