File descriptor limit change in AMI release `v20231220`

mmerkes commented 10 months ago

What happened: Customers are reporting hitting ulimits as a result of this PR #1535 What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

AWS Region:
Instance Type(s):
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion):
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version):
AMI Version:
Kernel (e.g. uname -a):
Release information (run cat /etc/eks/release on a node):

johnkeates commented 10 months ago

We have hit this issue too, we have about ~1700 pods crashlooping in each cluster. I wonder if the CI doesn't test using a large enough workload?

mmerkes commented 10 months ago

We have already reverted the change that caused this issue (#1535), ~~we're rolling back the v20231220 release~~ and we're preparing to release new AMIs without the change ASAP. More guidance to come.

EDIT: We're not rolling back v20231220. We're focusing on rolling forward the next release with the change reverted.

maksim-paskal commented 10 months ago

It help us to restore our pods on new nodes, we using Karpenter

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
...
spec:
 ....
  userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="BOUNDARY"

    --BOUNDARY
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash

    rm -rf /etc/systemd/system/containerd.service.d/20-limitnofile.conf

    --BOUNDARY--

and drain all new nodes from cluster

jpedrobf commented 10 months ago

@mmerkes Can you please update us when the AMI is ready for usage?

adwittumuluri commented 10 months ago

☝️ adding to that, an ETA would much appreciated as well. Is it in the magnitude of hours or days?

atishpatel commented 10 months ago

I'm using this setup for now in karpenter userData. Bumping soft limit from 1024 to 102400

Adding this to our bootstrap for now to 10x the soft limit.

- /usr/bin/sed -i 's/^LimitNOFILE.*$/LimitNOFILE=102400:524288/' /etc/systemd/system/containerd.service.d/20-limitnofile.conf || true

pkoraca commented 10 months ago

If anyone needs, we fixed it in Karpenter by hardcoding the older AMI in AWSNodeTemplate CRD

spec:
  amiSelector:
    aws::ids: <OLD_AMI_ID>

cartermckinnon commented 10 months ago

A go runtime change in 1.19 automatically maxes-out the process’ NOFILE limit, so I would expect to see this problem with go binaries on earlier versions: https://github.com/golang/go/issues/46279

Has anyone run into this problem with a workload that isn’t a go program?

mmerkes commented 10 months ago

an ETA would much appreciated as well. Is it in the magnitude of hours or days?

We are working on releasing a new set of AMIs ASAP. I will post another update in 3-5 hours on the status. We should have a better idea then.

1lann commented 10 months ago

Has anyone run into this problem with a workload that isn’t a go program?

People have mentioned running into this problem on envoy proxy, which is a C++ program.

cartermckinnon commented 10 months ago

People have mentioned running into this problem on envoy proxy

Yes, I've been looking into that. Envoy doesn't seem to bump its own soft limit, and it also seems to crash hard when the limit is hit (on purpose): https://github.com/aws/aws-app-mesh-roadmap/issues/181

Other things I've noticed:

The soft limit of 1024 is the default on ECS: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Ulimit.html
Java's hotspot VM has bumped the limit by default for ~20 years; point being there's wide variety in how the nofile limit is handled: https://github.com/openjdk/jdk/blob/93fedc12db95d1e61c17537652cac3d4e27ddf2c/src/hotspot/os/linux/os_linux.cpp#L4575-L4589

suket22 commented 10 months ago

The EKS provided SSM Parameter to reference to the current EKS AMI has been reverted to reference the last good AMI in all regions globally. This will automatically resolve the issue for Karpenter and Managed node group users and any other systems that determine the latest EKS AMI from the SSM Parameter.

We will provide another update by December 29 at 5:00 PM with a deployment timeline for new AMIs.

polarathene commented 10 months ago

We have already reverted the change that caused this issue

It'd be ideal to identify what software is not compatible and actually getting that addressed, but I understand the need to revert for the time being.

So long as you avoid infinity, most software will have minimal regression:

2^10 vs 2^20 slows some affected tasks by roughly 1,000x, as opposed to 2^30 where the delta is substantial.
If software relies on the legacy select(2) syscall it expects the soft limit to be 1024 to correctly function (additional select() concerns documented here in a dedicated section).
For some software like Envoy, it can potentially exceed the traditional 2^20 hard limit. This has been reported on their GH issue tracker already. infinity would avoid that, but it would have been wiser for only Envoy to raise it's limit that high, than expect the environment to workaround Envoy needs, due to prior regression concern points.

If you need to set an explicit limit (presumably because defaults are not sufficient), and the advised 1024:524288 isn't enough due to software not requesting to raise it's limits... You could try matching the suggested hardlimit: LimitNOFILE=524288, or double that for the traditional hard limit (2^20).

That still won't be sufficient for some software as mentioned, but that is software that should know better and handle it's resource needs properly, exhausting the FD limit is per-process, so it's not necessarily an OOM event. The system-wide FD limit is much higher (based on memory IIRC).

People have mentioned running into this problem on envoy proxy, which is a C++ program.

Envoy requires a large number of FDs, they have expressed that they're not interested in raising the soft limit internally and that admins should instead set a high enough soft limit.

I've since opened a feature request to justify why Envoy should raise the soft limit rather than defer that to be externally set high where it can negatively impact other software.

References:

2. Java's hotspot VM has bumped the limit by default for ~20 years;

https://github.com/systemd/systemd/blob/1742aae2aa8cd33897250d6fcfbe10928e43eb2f/NEWS#L60..L94

Note that there are also reports that using very high hard limits (e.g. 1G) is problematic: some software allocates large arrays with one element for each potential file descriptor (Java, …) — a high hard limit thus triggers excessively large memory allocations in these applications.

For infinity, this could require 1,000 - 1,000,000 times as much memory (MySQL, not Java but an example of excessive memory allocation impact, coupled with usual increased CPU load), even though you may not need that much FDs, hence a poor default.

For Java, related to the systemd v240 release notes, there was this github comment at the time about Java's memory allocation. With the 524288 hard-limit that was 4MB, but infinity when resolving to 2^30 (many modern distros) would equate to 2,000x that (8GB).

While you cite 20 years, note that the hard-limit has incremented over time.

That setting choice would have been a non-issue for most of that time, and a DIY workaround of overriding the hard-limit sweeps it under the rug 😅
The most substantial increase to the hard-limit was introduced with systemd v240 in 2018Q4 raising it to 2^30.
The systemd v240 release took a bit longer to arrive in downstreams of course. While some distros like Debian patched out the 2^30 hard-limit increase (their actual motivation for this IIRC was actually due to a patched PAM issue that wasn't being resolved properly).
I haven't looked into the present state of JDK or MySQL to see if they still allocate excessively with a 2^30 hard-limit.

point being there's wide variety in how the nofile limit is handled

This was all (excluding Envoy) part of my original research into moving the LimitNOFILE=1024:524288 change forward. If you want a deep-dive resource on the topic for AWS, I have you covered! 😂

Systemd has it right AFAIK, sane soft and hard limits. For AWS deployments some may need a higher hard limit, but it's a worry when software like Envoy doesn't document anything about that requirement and advises the stance of raising the soft limit externally.

adjain131995 commented 10 months ago

We were using karpenter which again is an AWS backed tool and it started picking up the new AMI dynamically as we started facing issues. As a hot fix we have harcoded the previous AMI amiSelector: aws::name: amazon-eks*node-1.25-v20231201 However, looking forward to the AMI fix we can make it dynamic again

The root cause: https://github.com/awslabs/amazon-eks-ami/pull/1535

ndbaker1 commented 10 months ago

As an update to the previous announcement, we are tracking for a new release by January 4th.

Collin3 commented 10 months ago

As an update to the previous announcement, we are tracking for a new release by January 4th.

@ndbaker1 is this file descriptor limit change expected to be reintroduced to that release? Or will that still be excluded? Just wondering if we need to pin our AMI version until we implement our own fix for istio/envoy workloads or something is implemented in envoy itself to handle that change better

cartermckinnon commented 10 months ago

@Collin3 that change has been reverted and will not be in the next AMI release 👍

cartermckinnon commented 10 months ago

This is resolved in the latest release: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20231230

awslabs / amazon-eks-ami

File descriptor limit change in AMI release `v20231220` #1551