[Bug] - Timeouts in execution in forked commands

wonko commented 2 months ago

Describe the bug We had issues running haproxy with a check-command configured in a container on EKS, using AL2023-EKS as our image for the worker nodes.

The result was a timeout on the check-command. We believe this is only a single manifestation of the issue, we suspect it to be present on other processes as well. It. is unclear what exactly goes wrong, but it feels like the process can't fork, or that once forked, the process can't run. We suspected ulimit, but that all seemed correct and sufficiently high. strace pointed to high amount of calls to polling, but it was unclear why that might happen. dmesg didn't have any pointers.

To Reproduce Steps to reproduce the behavior:

Deploy EKS 1.29 cluster
Deploy worker nodes with AL2023-EKS-1.29-20240415
Using helm, install percona operator and the database for mysql/galera: https://docs.percona.com/percona-operator-for-mysql/pxc/helm.html
Observe the behaviour inside the haproxy containers:
- proxy layer isn't getting ready, it keeps failing
- high CPU load
- reports that the external check command timed out

We tried changing the external check command to the simple /bin/true, but that didn't help. This ruled out that the problem was with the specific check script.

When switching to Bottlerocket or AL2, and redeploying exactly the same, all goes well. We verified switching back, and the problems just returned. Various machines types have been tried, same issue. Various versions of the operator/database helm chart (using different images), same result.

Expected behavior Stable environment, all systems online and working. No restarts. As is with Bottlerocket or AL2.

code-eg commented 2 weeks ago

I am unsure if related, but I have experienced odd timeouts with EKS and AL23 nodes. At high load, we see networking timeouts just trying to access the kubernetes service from pods as well as trying to get logs from pods. Reverting to AL2 resolved this issues completely for us.

I brought this to AWS Support for EKS and they haven't been able to find anything of note.

opeologist commented 1 week ago

any updates here? having to revert down to AL2 isn't a viable long-term solution

orsenthil commented 6 days ago

Can you specify the Node AMI version that displayed this behavior?

wonko commented 5 days ago

all i have is "AL2023-EKS-1.29-20240415"

orsenthil commented 3 days ago

@wonko - I launched a curl container inside a pod, that can repeatedly (more than 100 times) communicate with the API Server. We didn't see any connectivity issue on AL2023 and AMI - 1.29.3-20240625

It is perhaps kind of the client that is playing a part here?

amazonlinux / amazon-linux-2023

[Bug] - Timeouts in execution in forked commands #693