Containers fail to create and probe exec errors related to seccomp on recent kernel-5.10 versions

essh commented 1 year ago

What happened:

After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error:

Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7e64d5e682113437c8c07b8301771e53c710a6ca6ee": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

This issue is very similar to https://github.com/awslabs/amazon-eks-ami/issues/1179. However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads.

We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing.

What you expected to happen:

Containers to launch successfully and become Ready
Liveness an readiness probes to execute successfully

How to reproduce it (as minimally and precisely as possible):

I don't currently have a reproduction that I can share due to my current one using some internal code (I can hopefully produce a more generic one if required when I get a chance).

As a starting point we only noticed this happening on nodes that had pods scheduled on them which had an exec liveness & readiness probe running every 10 seconds that performs a health check against a gRPC service using grpcurl. In addition to this we also have a default Pod Security Policy (yes we know they are deprecated 😄) that has the following annotation seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default.

These two conditions seem to be enough to trigger this issue and the values reported by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' will steadily increase over time until containers can no longer be created on the node.

Anything else we need to know?:

Environment:

AWS Region: Multiple
Instance Type(s): Mix of x86_64 and arm64 instances of varying sizes
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.4"
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.24"
AMI Version: v20230217
Kernel (e.g. uname -a): 5.10.165-143.735.amzn2.x86_64 #1 SMP Wed Jan 25 03:13:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-09bffa74b1e396075"
BUILD_TIME="Fri Feb 17 21:59:10 UTC 2023"
BUILD_KERNEL="5.10.165-143.735.amzn2.x86_64"
ARCH="x86_64"

Official Guidance

Kubernetes pods using SECCOMP filtering on EKS optimized AMIs based on Linux Kernel version 5.10.x may get stuck in ContainerCreating state or their liveness/readiness probes fail with the following error:

unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524

When a process with SECCOMP filters creates a child process, the same filters are inherited and applied to the new process. The Amazon Linux kernel versions 5.10.x are affected by a memory leak that occurs when parent process is terminated while creating a child process. When the total amount of memory allocated for SECCOMP filter is over the limit, a process cannot create a new SECCOMP filter. As a result, the parent process fails to create a new child process and the above error message will be logged.

This issue is more likely to be encountered with kernel versions kernel-5.10.176-157.645.amzn2 and kernel-5.10.177-158.645.amzn2 where the rate of the memory leak is higher.

Amazon Linux will be releasing the fixed kernel by May 1st, 2023. We will be releasing a new set of EKS AMIs with the updated kernel latest by May 3rd, 2023.

essh commented 1 year ago

I've managed to build a reliable reproduction for this issue that I can now share. A quick summary is that the impact seems to depend on instance type. I have been able to consistently reproduce this issue on c5d.xlarge & c5a.xlarge instance types (x86_64). I have seen some bpf_jit memory growth on c6g.xlarge instances (arm64) but it seems a bit slower and I haven't seen containers fail to create on these nodes yet as a result. I can't reproduce this issue on a t3a.large instance as bpf_jit memory levels remain pretty consistent.

The easiest way to reproduce this is to spin up a fresh EKS 1.24 cluster and add a single node of the required instance type (this makes it easier to observe) running EKS AMI v20230217 (or v20230304). Then run the following commands:

kubectl delete clusterrolebinding eks:podsecuritypolicy:authenticated
kubectl delete clusterrole eks:podsecuritypolicy:privileged
kubectl delete podsecuritypolicy eks.privileged
kubectl apply -f https://gist.githubusercontent.com/essh/f7dd219a48df25e7294847484da112b7/raw/503ff9a8f32f19430040cd65c213479979bfcc3c/bpf-jit-leak.yaml

This removes the eks.privileged PSP, installs PSPs that use seccomp and starts up a simple app with some exec probes that trigger the issue. The container used for this app is built from the source available at https://github.com/essh/grpc-greeter-node.

Once this is running you can observe memory growth by executing sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' on the node. You can tweak the replica count up and down to speed up/slow down this process. If you leave it long enough the value will exceed net.core.bpf_jit_limit and you will end up with failure to create containers/exec probes. We were seeing this after about 2-3 days with our node types/workloads in our environment.

This same test against EKS AMI v20230203 nodes or lower (kernel 5.4) does not exhibit this issue.

cartermckinnon commented 1 year ago

@essh really appreciate the details; I'm following up internally with our kernel folks and will update here as I try to reproduce.

essh commented 1 year ago

If it helps I see the same behaviour with the following much simpler manifest that doesn't require any of the (deprecated/removed) PSP fiddling. You can apply this directly to a newly created cluster that meets the reproduction requirements, nothing else required.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bpf-jit-leak
  labels:
    app: bpf-jit-leak
spec:
  replicas: 8
  selector:
    matchLabels:
      app: bpf-jit-leak
  template:
    metadata:
      labels:
        app: bpf-jit-leak
    spec:
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: bpf-jit-leak
        image: essh/grpc-greeter-node:latest
        ports:
        - containerPort: 50051
          name: grpc
          protocol: TCP
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256Mi
        livenessProbe:
          exec:
            command:
            - /opt/app/scripts/health.sh
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command:
            - /opt/app/scripts/health.sh
          failureThreshold: 1
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 3
          timeoutSeconds: 5

Without the following on the spec I don't see the issue, i.e. the value reported by sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' is not periodically increasing.

      securityContext:
        seccompProfile:
          type: RuntimeDefault

borkmann commented 1 year ago

@essh really appreciate the details; I'm following up internally with our kernel folks and will update here as I try to reproduce.

@essh @cartermckinnon I happened to take a look at this recently, and tried to reproduce this on latest bpf tree kernel. I dumped the values around bpf_jit_charge_modmem and bpf_jit_uncharge_modmem, in particular the size passed in and the value of bpf_jit_current after the operation. They all look sane to me. For example, when running tcpdump with a specific filter (e.g. tcpdump -i lo tcp) but also a test application loading a seccomp BPF policy, I can see the bpf_jit_current counter going up and then discharging again with the same value. Also I tested on native eBPF programs, same here. This all looks good to me.

@cartermckinnon if you follow-up with kernel folks, I'd suggest to check the same.. meaning, is bpf_jit_current steadily increasing (and never decreasing) or does it look sane when loading/unloading programs and just the default limit is too low.

Either way, the default limit for any BPF user for the JIT is currently set to 1/4 of the module memory space, and I'll send an upstream patch (and also recommend for stable) to bump this default limit to 1/2.

From @essh's description though, it looks like the counter is never decreasing which looks like an AWS kernel bug if indeed true, perhaps some backport going wrong, etc. Would be good to double check.

borkmann commented 1 year ago

Looks like potentially missing kernel commit in seccomp causing this issue: a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") (via https://lore.kernel.org/bpf/20230321170925.74358-1-kuniyu@amazon.com/)

stevo-f3 commented 1 year ago

Is memleak (mentioned in https://lore.kernel.org/bpf/20230321170925.74358-1-kuniyu@amazon.com/) fixed in 5.4? If so, would it make sense for kernel in amazon-eks-ami published AMI to be downgraded from 5.10 to 5.4 until memleak fix is "backported" to 5.10 and newer?

borkmann commented 1 year ago

5.4 kernel would not be affected as it does not seem to have the offending commit 3a15fb6ed92c ("seccomp: release filter after task is fully dead") which a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") fixes.

stevo-f3 commented 1 year ago

Thanks @borkmann for heads up!

It's non trivial to downgrade the kernel downstream when building AMI based on this upstream EKS node AMI which is on kernel 5.10; It would be great that this upstream AMI gets downgraded to kernel 5.4 (at least until memory leak is backported to affected 5.10+ kernels), and anyone that really needs 5.10 or newer and can live with known memory leak, can more easily upgrade the kernel on their own in custom AMI based on the upstream one. WDYT?

borkmann commented 1 year ago

I'll defer to AWS folks with regards to your question, Cc @cartermckinnon. Hopefully this can be fixed quickly by cherry-picking the two commits below for EKS 5.10 kernel.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a1140cb215fa13dcec06d12ba0c3ee105633b7c4 https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=10ec8ca8ec1a2f04c4ed90897225231c58c124a7

dims commented 1 year ago

@borkmann ACK on behalf of @cartermckinnon. please give us some time to do things...

cartermckinnon commented 1 year ago

Unfortunately the series of patches we've cherrypicked internally does not seem to resolve the issue. We're still looking into it.

I was not able to reproduce this with 5.15, so we're diff-ing the changelog as well.

cartermckinnon commented 1 year ago

It's non trivial to downgrade the kernel downstream when building AMI based on this upstream EKS node AMI which is on kernel 5.10

@stevo-f3 This should do it:

yum versionlock delete kernel
amazon-linux-extras disable kernel-5.10
amazon-linux-extras enable kernel-5.4
yum install -y kernel

At present, we have more users needing 5.10 who are not experiencing this leak than those who are; downgrading the official build to 5.4 would be a last resort if we can't put a fix together.

stevo-f3 commented 1 year ago

We can't use 5.15 - recently downgraded to 5.10, with 5.15 were experiencing kernel panics on instance startup, on production only.

Thanks for downgrade to 5.4 instructions, is trivial after all, will use it at least until fix available.

q2ven commented 1 year ago

After backporting a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") to our 5.10 kernel, I didn't see any memleak with @essh 's repro. I will release a new kernel with the commit and post the backport patch to the upstream 5.10 tree as well.

cartermckinnon commented 1 year ago

Thank you @q2ven!

dims commented 1 year ago

https://github.com/amazonlinux/linux/commits/kernel-5.10.176-157.645.amzn2 looks promising as it has this patch https://github.com/amazonlinux/linux/commit/fbe210f3421dfc2c8d4b4fc5c34c002099cf0a14

Also as you can see the spam above ^^^: https://lore.kernel.org/all/20230320143725.8394-1-daniel@iogearbox.net/

dims commented 1 year ago

@essh Can you please this kernel? kernel-5.10.177-158.645.amzn2 and let us know?

rarecrumb commented 1 year ago

I'm running Karpenter and running into this issue, seems to be selecting the latest AMI available for EKS.

Is there any fix I can apply? Very worrying.

dims commented 1 year ago

@rarecrumb looking at the kernel version of the newest release: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20230411#:~:text=amazon%2Deks/1.22.17//-,AMI%20details,-%3A

it says: Kubernetes 1.24 and above: 5.10.176-157.645.amzn2

Which is older than the AMI i pointed to above: kernel-5.10.177-158.645.amzn2

So please wait for next drop or build an AMI with the kernel version above.

cc @mmerkes @cartermckinnon

rarecrumb commented 1 year ago

@rarecrumb looking at the kernel version of the newest release: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20230411#:~:text=amazon%2Deks/1.22.17//-,AMI%20details,-%3A

it says: Kubernetes 1.24 and above: 5.10.176-157.645.amzn2

Which is older than the AMI i pointed to above: kernel-5.10.177-158.645.amzn2

So please wait for next drop or build an AMI with the kernel version above.

cc @mmerkes @cartermckinnon

How long until a new AMI is published with this fix in place?

essh commented 1 year ago

@essh Can you please this kernel? kernel-5.10.177-158.645.amzn2 and let us know?

I've tried updating to kernel-5.10.177-158.645.amzn2 as suggested but my reproduction is still showing continuing growth when testing with the sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' command. If it's felt this kernel should fix the issue is there anything else I could do to confirm this outside of waiting some number of days for things to start failing?

SimonKO9 commented 1 year ago

I can also confirm that issues still persist with 5.10.177-158.645.amzn2.

Wyifei commented 1 year ago

After Upgrade EKS to 1.25, we also met this issue. We're using EKS managed group, will the Kernel fix apply to the EC2 node automatically?

billo-zymantas commented 1 year ago

Having same issue with EKS 1.26 managed group. Node group AMI version 1.26.2-20230411

kirinnee commented 1 year ago

Having the same issue with EKS 1.25. AMI version: 5.10.176-157.645.amzn2.x86_64

tomislater commented 1 year ago

The same problem for this setup:

Kernel version: 5.10.176-157.645.amzn2.x86_64
Kubelet version: v1.24.11-eks-a59e1f0

On:

Kernel version: 5.4.226-129.415.amzn2.x86_64
Kubelet version: v1.24.7-eks-fb459a0

works great.

dims commented 1 year ago

is still showing continuing growth

@essh growth at the same pace as before or is it slower growth rate originally?

mmerkes commented 1 year ago

After Upgrade EKS to 1.25, we also met this issue. We're using EKS managed group, will the Kernel fix apply to the EC2 node automatically?

@Wyifei Once a new AMI is released, you'll need to perform an upgrade on your managed nodegroup to get the fixes.

As for the next AMI release, we're working on preparing the next AMI release and I can confirm that it has kernel 5.10.177-158.645.amzn2 for all variants that use the 5.10 kernel. We'll provide updates on GitHub.

dims commented 1 year ago

@SimonKO9 what's your scenario? (seccomp? cilium?)

SimonKO9 commented 1 year ago

@dims it's seccomp, containers running on a node crash eventually and no more can be started (unless seccomp=unconfined is passed explicitly). We downgraded to 5.4 and we're running without issues now.

Maybe it will be useful for anyone - as a temporary remedy we simply rebooted the nodes which allowed them to run for a day or two until the issue returned. It let us survive the weekend and we could implement the downgrade just today.

rarecrumb commented 1 year ago

I've switched to Bottlerocket

alexku7 commented 1 year ago

Have the same issue after upgrading to the latest v20230411.

The nodes get the error within less than 1 day!

Reverted to previous v20230406 seems solved the problem

essh commented 1 year ago

@essh growth at the same pace as before or is it slower growth rate originally?

The pace felt roughly the same but I did not perform a side by side comparison at this stage.

ChrisV78 commented 1 year ago

We have the same issue, hoping for a quick fix, otherwise have to go to BottleRocket I guess.

cartermckinnon commented 1 year ago

We're following up on this with our kernel folks; we believe we've identified the necessary patches. I'll update here once we've verified and have a kernel build in the pipeline.

dougbaber commented 1 year ago

Just wanted to make sure you are aware that we see the issue with ECS as well, not just EKS. We have replicated it with kernel 5.10.177-158.645.amzn2

cartermckinnon commented 1 year ago

@dougbaber thanks! I'll make sure the ECS team is aware of this issue; any users of recent 5.10 kernel builds would be impacted.

mynkkmr commented 1 year ago

We created a new EKS cluster on version 1.24, After that below error started to show while containers are starting up.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

Any plans of reverting this to the last stable version till the time AWS finds a fix?

reedjosh commented 1 year ago

We created a new EKS cluster on version 1.24, After that below error started to show while containers are starting up.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

Any plans of reverting this to the last stable version till the time AWS finds a fix?

This ☝️

Why is a broken AMI still the default for Amazon's managed node groups?

Can't that be backed out or the release pulled?

islishude commented 1 year ago

Can team rollback the commits and make a new release?

radimk commented 1 year ago

FWIW with eksctl it is possible to pin previous version with

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: k8s
managedNodeGroups:
  - name: nodegroup
    releaseVersion: 1.24.11-20230406  # or any other from https://github.com/awslabs/amazon-eks-ami/releases
    ...

Of course this is very unfortunate bug that renders our nodes unusable within day or two even with increased bpf_jit_limit and we're hoping for a quick fix.

mmerkes commented 1 year ago

Official guidance:

Kubernetes pods using SECCOMP filtering on EKS optimized AMIs based on Linux Kernel version 5.10.x may get stuck in ContainerCreating state or their liveness/readiness probes fail with the following error:

unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524

When a process with SECCOMP filters creates a child process, the same filters are inherited and applied to the new process. The Amazon Linux kernel versions 5.10.x are affected by a memory leak that occurs when parent process is terminated while creating a child process. When the total amount of memory allocated for SECCOMP filter is over the limit, a process cannot create a new SECCOMP filter. As a result, the parent process fails to create a new child process and the above error message will be logged.

This issue is more likely to be encountered with kernel versions kernel-5.10.176-157.645.amzn2 and kernel-5.10.177-158.645.amzn2 where the rate of the memory leak is higher.

Amazon Linux will be releasing the fixed kernel by May 1st, 2023. We will be releasing a new set of EKS AMIs with the updated kernel latest by May 3rd, 2023.

KrisJohnstone commented 1 year ago

Same issue is impacting us, subbed for updates.

mchlbataller commented 1 year ago

Subbing for updates as well, this issue has been impacting us for days now.

marcincuber commented 1 year ago

Same issue happening on my clusters. Subbing for latest updates.

bla-ckbox commented 1 year ago

Hi, the same for me, with EKS 1.26 AMI version: 5.10.176-157.645.amzn2.x86_64

macmiranda commented 1 year ago

Hey guys, appreciate you're all subbing but think of the people that are already subbed getting all these pointless messages.

If you're not gonna add any information that's relevant to the resolution of the issue please refrain from sending another message and just click the subscribe button.

nille commented 1 year ago

The kernel fix seems to have been released now, as 5.10.178-162.673.amzn2.x86_64.

mmerkes commented 1 year ago

Yes, it's available. Folks that manage custom AMIs can start using the kernel and we're preparing AMIs for release on Wednesday that will include the latest kernel.

mmerkes commented 1 year ago

The v20230501 release has started now, and it includes 5.10.178-162.673.amzn2.x86_64 for all AMIs that use 5.10 kernels. We have tested the kernel and expect it to resolve this issue for customers. New AMIs should be available in all regions late tonight (PDT).

awslabs / amazon-eks-ami

Containers fail to create and probe exec errors related to seccomp on recent kernel-5.10 versions #1219

Official Guidance

Official guidance: