bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.77k stars 517 forks source link

Hitting limit on open File Descriptors after update to 1.22.0-b6684b72 #4201

Closed sebas-w closed 1 month ago

sebas-w commented 1 month ago

Image I'm using: 1.22.0-b6684b72

What I expected to happen:

Apps should function the same as when using the AMI image 1.21.1-82691b51.

What actually happened: After the AMI update, we saw that the metric node_filefd_allocated increased dramatically after the AMI image update. This was the only change made within our clusters at the time.

With the max being hit, the nodes are going into a Not Ready status and applications are left without the ability to run and it seems to have issues self healing when this problem occurs as well.

Initial issue metrics: image

my test on another cluster to verify change comes with latest Bottlerocket OS AMI 1.22.0-b6684b72: image

I tested this by updating the AMI's on another one of our nodes that had not been updated, and the problem started occuring. The workloads between the two clusters are vastly different so it is clear this is not caused by one of the applications running on the clusters we observed this behavior on.

How to reproduce the problem:

Update to latest Bottlerocket OS AMI image and check the values from the node exporter for the metrics node_filefd_allocated

How did I work around it:

I had to swap to using Linux AMI's instead of Bottlerocket AMI's in order to mitigate the issue on our production cluster. Seeing no other way to solve this as it seems introduced in latest Bottlerocket AMI.

pauloeliasjr commented 1 month ago

Also experiencing this issue. Lots of pods get stuck in Pending or Terminating states on nodes upgraded to this version.

larvacea commented 1 month ago

We're investigating. If you can identify a process that's leaking file descriptors post-update, that would be a big help. In addition, if you can identify the variant (which is part of the AMI's name, e.g. bottlerocket-aws-k8s-1.29-x86_64-v1.22.0-b6684b72 is the aws-k8s-1.29 variant, and there is a similarly-named bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.22.0-b6684b72 AMI and variant for kubernetes 1.29 with NVIDIA drivers on the x86_64 platform). Also, if there's anything non-default in ulimits, cgroups, or the like that you think might be helpful.

sebas-w commented 1 month ago

@larvacea,

Using a script and method shared with me in a shell to access the node with the command:

k debug node/<node name> -it --image=busybox

for pid in /proc/[0-9]*; do
  echo "$(ls -l $pid/fd 2>/dev/null | wc -l) $pid"
done | sort -nr | head -n 10

I get the output:

423476 /proc/1984
419 /proc/1915
206 /proc/62867
62 /proc/7590
52 /proc/1
50 /proc/5452
50 /proc/2164
50 /proc/15879
40 /proc/178565
39 /proc/2161

checking for /proc/1984 i see it is 1984 root 2h46 /usr/bin/kubelet

comparing this to another node running the old AMI I see kubelet is running normally:

ps auwx | grep kubelet

1238 root      1d01 /usr/bin/kubelet
pid=1238
if [ -d /proc/$pid ]; then
  echo "$(ls -l /proc/$pid/fd 2>/dev/null | wc -l) /proc/$pid"
else
  echo "Process ID $pid does not exist."
fi

That gives the output: 33 /proc/1238

Running much more tame than the other kublet process on the updated nodes.

AMI ID: ami-0bc7c40e3f8155e66 AMI NAME: bottlerocket-aws-k8s-1.28-x86_64-v1.22.0-b6684b72

The only options we include are to overcome a problem with the shared PVC's. There's an issue where it won't respect the option for multiple pods to mount a PVC volume on the same host, it will mount to the first fine, when the second pod mounts it, it prevents teh first from accessing until a restart at which point the second then loses it's access. we pass these in as bootstrap_extra_args in the tf module we use to deploy the cluster.

Those options are:

[settings.boot]
reboot-to-reconcile = true

[settings.boot.init-parameters]
"systemd.unified_cgroup_hierarchy" = ["0"]

I ~think that covers all that you asked, if I missed anything please let me know!

larvacea commented 1 month ago

That's great! Very helpful and the fds on kubelet do look troubling.

andrescaroc commented 1 month ago

I am also experiencing this issue in 6 clusters. I am using karpenter with the following setting, and it is picking up the 1.22 version of bottlerocket

spec:
  amiSelectorTerms:
  - alias: bottlerocket@latest

It is affecting nodes with gpu and without it. (So I think the issue is not related with that)

This is what I get from the Events of the nodes that turn into NotReady state:

Events:
  Type     Reason             Age                  From             Message
  ----     ------             ----                 ----             -------
  Normal   Unconsolidatable   27m (x116 over 35h)  karpenter        SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
  Warning  ContainerGCFailed  17m                  kubelet          failed to read podLogsRootDirectory "/var/log/pods": open /var/log/pods: too many open files
  Normal   NodeNotReady       16m (x2 over 2d9h)   node-controller  Node ip-192-168-145-111.eu-central-1.compute.internal status is now: NodeNotReady
  Normal   DisruptionBlocked  13m (x3 over 17m)    karpenter        Cannot disrupt Node: pdb "knative-eventing/eventing-webhook" prevents pod evictions

You can see more details of my issue here

pauloeliasjr commented 1 month ago

@larvacea,

Using a script and method shared with me in a shell to access the node with the command:

k debug node/<node name> -it --image=busybox

for pid in /proc/[0-9]*; do
  echo "$(ls -l $pid/fd 2>/dev/null | wc -l) $pid"
done | sort -nr | head -n 10

I get the output:

423476 /proc/1984
419 /proc/1915
206 /proc/62867
62 /proc/7590
52 /proc/1
50 /proc/5452
50 /proc/2164
50 /proc/15879
40 /proc/178565
39 /proc/2161

checking for /proc/1984 i see it is 1984 root 2h46 /usr/bin/kubelet

comparing this to another node running the old AMI I see kubelet is running normally:

ps auwx | grep kubelet

1238 root      1d01 /usr/bin/kubelet
pid=1238
if [ -d /proc/$pid ]; then
  echo "$(ls -l /proc/$pid/fd 2>/dev/null | wc -l) /proc/$pid"
else
  echo "Process ID $pid does not exist."
fi

That gives the output: 33 /proc/1238

Running much more tame than the other kublet process on the updated nodes.

AMI ID: ami-0bc7c40e3f8155e66 AMI NAME: bottlerocket-aws-k8s-1.28-x86_64-v1.22.0-b6684b72

The only options we include are to overcome a problem with the shared PVC's. There's an issue where it won't respect the option for multiple pods to mount a PVC volume on the same host, it will mount to the first fine, when the second pod mounts it, it prevents teh first from accessing until a restart at which point the second then loses it's access. we pass these in as bootstrap_extra_args in the tf module we use to deploy the cluster.

Those options are:

[settings.boot]
reboot-to-reconcile = true

[settings.boot.init-parameters]
"systemd.unified_cgroup_hierarchy" = ["0"]

I ~think that covers all that you asked, if I missed anything please let me know!

Can confirm same behavior, but using image suited for Kube 1.29 on EKS. Nodes running on previous image doesn't show the issue.

I do not have any special configuration in my clusters.

liamawhite commented 1 month ago

Fyi: If you are able to, bumping to 1.30 seems to fix it.

Screenshot 2024-09-18 at 5 20 08 AM

larvacea commented 1 month ago

Thank you for the report that 1.30 does not exhibit this behavior. Not only is it a helpful suggestion for mitigating the problem, it helps focus our attention on the differences between the affected versions and 1.30.

sebas-w commented 1 month ago

@larvacea, to add my further findings:

Our cluster was using Calico, we notated in the logs the following:

Sep 17 15:50:00 kubelet[1992]: E0917 15:50:00.037705    1992 plugins.go:723] "Error dynamically probing plugins" err="error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input"
Sep 17 15:50:00 kubelet[1992]: E0917 15:50:00.038780    1992 driver-call.go:262] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
Sep 17 15:50:00 kubelet[1992]: W0917 15:50:00.038803    1992 driver-call.go:149] FlexVolume: driver call failed: executable: /var/lib/kubelet/plugins/volume/exec/nodeagent~uds/uds, args: [init], error: executable file not found in $PATH, output: ""

This seems to be similar to the issues:

This led us to test removing Calico from one of the clusters with surprising results. The # of allocated FDs went back to normal for the kubelet process.

Before image

After image

This seems to indicate that it is possibly a bug in kubelet. I tried digging through the release notes for kubelet but nothing is jumping out at me as immediately obvious.

larvacea commented 1 month ago

I can reproduce the problem. First impression is that there is an unfortunate interaction between FlexVolume and SELinux:

Sep 18 17:54:38 ip-192-168-34-103.us-west-1.compute.internal kubelet[1248]: W0918 17:54:38.791964    1248 driver-call.go:149] FlexVolume: driver call failed: executable: /var/lib/kubelet/plugins/volume/exec/nodeagent~uds/uds, args: [init], error: executable file not found in $PATH, output: ""
Sep 18 17:54:38 ip-192-168-34-103.us-west-1.compute.internal kubelet[1248]: E0918 17:54:38.791992    1248 plugins.go:723] "Error dynamically probing plugins" err="error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input"
Sep 18 17:54:38 ip-192-168-34-103.us-west-1.compute.internal audit[9751]: AVC avc:  denied  { execute } for  pid=9751 comm="kubelet" name="uds" dev="nvme1n1p1" ino=134460038 scontext=system_u:system_r:system_t:s0 tcontext=system_u:object_r:local_t:s0 tclass=file permissive=0

This failure repeats in a tight loop, leaving an open fd behind, and eventually kubelet devours the node.

Thank you all for your helpful reports.

larvacea commented 1 month ago

I am still working to find the root cause of the file descriptor leak.

In the meantime, I would recommend disabling Calico's FlexVolume if you are not using it. At present it will result in a tight retry loop in kubelet, as SELinux denies kubelet permission to run this plugin uds binary. Bottlerocket's SELinux policies disallow the execution of unknown binaries.

On the other hand, if you are using or would like to use FlexVolume, do open an issue.

larvacea commented 1 month ago

In the 1.22.0 release Bottlerocket updated the version of Go used to build most Go components to Go 1.23, including all versions of kubernetes code prior to k8s 1.30. It appears that Go 1.23 has an issue where a failed call to os/exec will leak a file descriptor. The Calico FlexVolume feature results in a tight retry loop in kubelet that calls exec and fails very quickly. Any code that calls exec will leak a file descriptor; Calico just made the failure obvious.

ianstanton commented 1 month ago

I am still working to find the root cause of the file descriptor leak.

In the meantime, I would recommend disabling Calico's FlexVolume if you are not using it. At present it will result in a tight retry loop in kubelet, as SELinux denies kubelet permission to run this plugin uds binary. Bottlerocket's SELinux policies disallow the execution of unknown binaries.

On the other hand, if you are using or would like to use FlexVolume, do open an issue.

@larvacea Thank you for this info! What's the appropriate way to update this value? We've updated the tigera-operator's helm value spec.flexVolumePath=None per the documentation here, but it doesn't seem to be updating the Installation resource or the Calico related components.

ialejandro commented 1 month ago

I am still working to find the root cause of the file descriptor leak. In the meantime, I would recommend disabling Calico's FlexVolume if you are not using it. At present it will result in a tight retry loop in kubelet, as SELinux denies kubelet permission to run this plugin uds binary. Bottlerocket's SELinux policies disallow the execution of unknown binaries. On the other hand, if you are using or would like to use FlexVolume, do open an issue.

@larvacea Thank you for this info! What's the appropriate way to update this value? We've updated the tigera-operator's helm value spec.flexVolumePath=None per the documentation here, but it doesn't seem to be updating the Installation resource or the Calico related components.

So, when Installation is deployed even if update values from Helm chart doesn't update resources (should be calico controller). You need do it manually:

  1. Patch Installation CRD
kubectl patch installation <installation-name> --type=merge -p '{"spec": {"flexVolumePath": "None"}}'
  1. Restart all calico-node

(Optional) Instead restart all nodes, maybe you can restart one calico-node and check if node works properly. After that, rollout DaemonSet

kubectl -n <namespace> rollout restart daemonset/calico-node

Ensure only have calico-node and install-cni containers, flexvol-driver won't appear.

perplexa commented 1 month ago

In the 1.22.0 release Bottlerocket updated the version of Go used to build most Go components to Go 1.23, including all versions of kubernetes code prior to k8s 1.30. It appears that Go 1.23 has an issue where a failed call to os/exec will leak a file descriptor. The Calico FlexVolume feature results in a tight retry loop in kubelet that calls exec and fails very quickly. Any code that calls exec will leak a file descriptor; Calico just made the failure obvious.

https://github.com/golang/go/commit/c8c6f9abfbbd2f949285a527036a3d0dbd991e74#diff-1587342f077ea1dbe9673e4847da2423919de3899715421374649ab8004cef43

this looks to have been fixed in golang

larvacea commented 1 month ago

I expect that the backported Go fix will be in the next Go 1.23.x release, as @perplexa notes. For now, I have pinned the build for all kubernetes versions from 1.23 through 1.29 to Go 1.22.x (1.30 was already pinned to 1.22.x) and verified that this mitigates the problem. I expect this will be part of the next bottlerocket-core-kit release, which should then be part of the next bottlerocket release.

It's worth mentioning again that the Calico FlexVolume plugin is spinning in an error-retry loop in kubelet when enabled, and so one might wish to disable it (if unused) to save a bit of CPU time and journald thrashing.

Thanks for all your helpful input! It made root-causing and fixing much easier than I had first feared.

joey100 commented 1 month ago

In the 1.22.0 release Bottlerocket updated the version of Go used to build most Go components to Go 1.23, including all versions of kubernetes code prior to k8s 1.30. It appears that Go 1.23 has an issue where a failed call to os/exec will leak a file descriptor. The Calico FlexVolume feature results in a tight retry loop in kubelet that calls exec and fails very quickly. Any code that calls exec will leak a file descriptor; Calico just made the failure obvious.

Hi @larvacea Amazing findings! Thanks a lot! Just curious, would you mind sharing the troubleshooting steps, how did you navigate to the root cause? I think this is a great opportunity to learn the troubleshooting skills, methods, ...etc, as I can only see kubelet process open so many files on Bottlerocket OS 1.22, but on prior version it does not, even with the same calico configs and dmesg errors. Though it must be something related with OS 1.22, but just can't figure out the 'related thing'.

larvacea commented 1 month ago

Once I heard that Calico would help reproduce this behavior, the first step was to create a tiny cluster with Calico, add a node group with an affected AMI, and launch a do-nothing cron task to just launch as many pods as possible.

This immediately showed that rapid growth in open fds, and examining proc/*/fd showed that kubelet was the process responsible.

There were two changes that seemed possible sources of this behavior. We updated the cni package, and we changed the version of Go used to build the kubernetes packages. Bisection showed that cni was blameless, and Go 1.23.1 was the culprit.

Looking at journalctl -u kubelet revealed the error-retry loop. There are two Bottlerocket features that can trip up software written for general-purpose Linux distros, our immutable root filesystem and SElinux permissions. SElinux is visible in journalctl so I looked there first and saw the denied messages from each attempt to launch the plugin.

Given all this, it was easy to find the known issue in Go.

sebas-w commented 1 month ago

@larvacea, Awesome! thanks so much in persisting to find the cause here!

Do you happen to know when there might be a fixed version released? We're holding on any further upgrades or rollouts for our clusters as the majority of them use the Bottlerocket OS and if we upgraded more with this issue still present we'd have to swap back to Linux to avoid it.

larvacea commented 1 month ago

I am sorry, but by policy I can't comment on future dates. The fix has been released in the bottlerocket-core-kit repository. You can get a preview of the next operating system release by looking there.

chicofranchico commented 1 month ago

Updating to k8s 1.30 fixed the issue for us. node_filefd_allocated is rather stable

koooosh commented 1 month ago

Bottlerocket v1.23.0 has been released: https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.23.0

This OS release includes the fix for this issue from Bottlerocket Core-kit v2.7.0: https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/155