awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.44k stars 1.15k forks source link

Containers fail to create and probe exec errors related to seccomp on recent kernel-5.10 versions #1219

Closed essh closed 1 year ago

essh commented 1 year ago

What happened:

After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error:

Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7e64d5e682113437c8c07b8301771e53c710a6ca6ee": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

This issue is very similar to https://github.com/awslabs/amazon-eks-ami/issues/1179. However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads.

We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

I don't currently have a reproduction that I can share due to my current one using some internal code (I can hopefully produce a more generic one if required when I get a chance).

As a starting point we only noticed this happening on nodes that had pods scheduled on them which had an exec liveness & readiness probe running every 10 seconds that performs a health check against a gRPC service using grpcurl. In addition to this we also have a default Pod Security Policy (yes we know they are deprecated 😄) that has the following annotation seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default.

These two conditions seem to be enough to trigger this issue and the values reported by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' will steadily increase over time until containers can no longer be created on the node.

Anything else we need to know?:

Environment:

Official Guidance

Kubernetes pods using SECCOMP filtering on EKS optimized AMIs based on Linux Kernel version 5.10.x may get stuck in ContainerCreating state or their liveness/readiness probes fail with the following error:

unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524

When a process with SECCOMP filters creates a child process, the same filters are inherited and applied to the new process. The Amazon Linux kernel versions 5.10.x are affected by a memory leak that occurs when parent process is terminated while creating a child process. When the total amount of memory allocated for SECCOMP filter is over the limit, a process cannot create a new SECCOMP filter. As a result, the parent process fails to create a new child process and the above error message will be logged.

This issue is more likely to be encountered with kernel versions kernel-5.10.176-157.645.amzn2 and kernel-5.10.177-158.645.amzn2 where the rate of the memory leak is higher.

Amazon Linux will be releasing the fixed kernel by May 1st, 2023. We will be releasing a new set of EKS AMIs with the updated kernel latest by May 3rd, 2023.

mmerkes commented 1 year ago

v20230501 is available in all regions now! Update to the latest EKS Optimized AMIs and this issue should be resolved.

michelesr commented 1 year ago

Is the kernel fix actually fixing the bug for good or is it just bumping the default bpf jit memory limit? can you provide a link to the patch?

dims commented 1 year ago

@michelesr https://lore.kernel.org/bpf/20230321170925.74358-1-kuniyu@amazon.com/

dims commented 1 year ago

thanks @ljosyula @mmerkes @cartermckinnon @q2ven

https://github.com/awslabs/amazon-eks-ami/releases/tag/v20230501

skupjoe commented 3 weeks ago

We are still seeing this issue across our prod cluster after upgrading to 1.24.

Increasing the bpf_jit_limit does indeed fix the issue for pods stuck in Pending due to this seccomp issue. But it is, at best, only a temporary fix, and after a few days the new limit again gets saturated due to the underlying memory leak and we are faced with this problem again.

Looks like we are going to have to downgrade to kernel 5.4 as the only option for now.

skupjoe commented 3 weeks ago

That said, can we re-open this issue?

I see from the commits referenced to this issue that AWS team is still trying to wrangle/fix this bug, and so obviously this is still actively being worked on. But this issue being in Closed state falsely gives the impression that it has been fixed.

(Also, I am surprised that not more people are reporting this? Do we just have an unnatural amount of liveness/health check probes or something?)

cartermckinnon commented 3 weeks ago

@skupjoe We haven't had any reports of this since https://github.com/awslabs/amazon-eks-ami/issues/1219#issuecomment-1534536682, can you confirm the kernel version you're on?

skupjoe commented 1 week ago

I downgraded from 5.10 to 5.4.283-195.378.amzn2.x86_64 and unfortunately it is still happening after about ~3.5 days of uptime. It typically happens to ~3 nodes at a time in a ~10-node cluster.

Increasing the bpf_jit_limit immediately fixed the Pending status and helps things for another ~3 days, but then the issue comes back.

It seems to happen on instances of any size/type- I am currently looking at it happening on a c6a.large, a m5a.8xlarge, and a m5a.2xlarge node.

I am desperate to get this fixed and it has been happening to me ever since our EKS 1.24 upgrade. And now I am on 1.27 and will be upgrading to 1.28 tonight.

We are not using PSP and I don't see any seccomp config set anywhere at the pod level or on my k8s node config:

[root@ip-10-0-82-217 /]# sudo cat /etc/containerd/config.toml | grep -i seccomp

But the kernel supports it:

[root@ip-10-0-104-41 /]# grep SECCOMP /boot/config-$(uname -r)
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y

Any other suggestions? Or should I raise a new issue? Thank you.

skupjoe commented 6 days ago

I am considering moving away from amazon-eks-ami as I am desperate to get this issue fixed. Can anybody suggest a good replacement? Maybe Bottlerocket?

cartermckinnon commented 6 days ago

@skupjoe If you're seeing this issue on the 5.4 kernel branch, it is definitely not the same issue described above. Please open a new issue or an AWS support case and we can take a look.

I see from the commits referenced to this issue that AWS team is still trying to wrangle/fix this bug, and so obviously this is still actively being worked on. But this issue being in Closed state falsely gives the impression that it has been fixed.

The commit references are a kernel change that was made by a community member to increase the default JIT space for BPF programs. It is not a fix for the issue described here, which was a memory leak. The memory leak was fixed by @q2ven (IIRC). We have no reason to think there's been a regression.