Closed essh closed 1 year ago
I've managed to build a reliable reproduction for this issue that I can now share. A quick summary is that the impact seems to depend on instance type. I have been able to consistently reproduce this issue on c5d.xlarge
& c5a.xlarge
instance types (x86_64
). I have seen some bpf_jit
memory growth on c6g.xlarge
instances (arm64
) but it seems a bit slower and I haven't seen containers fail to create on these nodes yet as a result. I can't reproduce this issue on a t3a.large
instance as bpf_jit
memory levels remain pretty consistent.
The easiest way to reproduce this is to spin up a fresh EKS 1.24 cluster and add a single node of the required instance type (this makes it easier to observe) running EKS AMI v20230217
(or v20230304
). Then run the following commands:
kubectl delete clusterrolebinding eks:podsecuritypolicy:authenticated
kubectl delete clusterrole eks:podsecuritypolicy:privileged
kubectl delete podsecuritypolicy eks.privileged
kubectl apply -f https://gist.githubusercontent.com/essh/f7dd219a48df25e7294847484da112b7/raw/503ff9a8f32f19430040cd65c213479979bfcc3c/bpf-jit-leak.yaml
This removes the eks.privileged
PSP, installs PSPs that use seccomp and starts up a simple app with some exec
probes that trigger the issue. The container used for this app is built from the source available at https://github.com/essh/grpc-greeter-node.
Once this is running you can observe memory growth by executing sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
on the node. You can tweak the replica count up and down to speed up/slow down this process. If you leave it long enough the value will exceed net.core.bpf_jit_limit
and you will end up with failure to create containers/exec probes. We were seeing this after about 2-3 days with our node types/workloads in our environment.
This same test against EKS AMI v20230203
nodes or lower (kernel 5.4) does not exhibit this issue.
@essh really appreciate the details; I'm following up internally with our kernel folks and will update here as I try to reproduce.
If it helps I see the same behaviour with the following much simpler manifest that doesn't require any of the (deprecated/removed) PSP fiddling. You can apply this directly to a newly created cluster that meets the reproduction requirements, nothing else required.
apiVersion: apps/v1
kind: Deployment
metadata:
name: bpf-jit-leak
labels:
app: bpf-jit-leak
spec:
replicas: 8
selector:
matchLabels:
app: bpf-jit-leak
template:
metadata:
labels:
app: bpf-jit-leak
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: bpf-jit-leak
image: essh/grpc-greeter-node:latest
ports:
- containerPort: 50051
name: grpc
protocol: TCP
resources:
limits:
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
livenessProbe:
exec:
command:
- /opt/app/scripts/health.sh
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
readinessProbe:
exec:
command:
- /opt/app/scripts/health.sh
failureThreshold: 1
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 3
timeoutSeconds: 5
Without the following on the spec
I don't see the issue, i.e. the value reported by sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
is not periodically increasing.
securityContext:
seccompProfile:
type: RuntimeDefault
@essh really appreciate the details; I'm following up internally with our kernel folks and will update here as I try to reproduce.
@essh @cartermckinnon I happened to take a look at this recently, and tried to reproduce this on latest bpf tree kernel. I dumped the values around bpf_jit_charge_modmem and bpf_jit_uncharge_modmem, in particular the size passed in and the value of bpf_jit_current after the operation. They all look sane to me. For example, when running tcpdump with a specific filter (e.g. tcpdump -i lo tcp
) but also a test application loading a seccomp BPF policy, I can see the bpf_jit_current counter going up and then discharging again with the same value. Also I tested on native eBPF programs, same here. This all looks good to me.
@cartermckinnon if you follow-up with kernel folks, I'd suggest to check the same.. meaning, is bpf_jit_current steadily increasing (and never decreasing) or does it look sane when loading/unloading programs and just the default limit is too low.
Either way, the default limit for any BPF user for the JIT is currently set to 1/4 of the module memory space, and I'll send an upstream patch (and also recommend for stable) to bump this default limit to 1/2.
From @essh's description though, it looks like the counter is never decreasing which looks like an AWS kernel bug if indeed true, perhaps some backport going wrong, etc. Would be good to double check.
Looks like potentially missing kernel commit in seccomp causing this issue: a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") (via https://lore.kernel.org/bpf/20230321170925.74358-1-kuniyu@amazon.com/)
Is memleak (mentioned in https://lore.kernel.org/bpf/20230321170925.74358-1-kuniyu@amazon.com/) fixed in 5.4? If so, would it make sense for kernel in amazon-eks-ami published AMI to be downgraded from 5.10 to 5.4 until memleak fix is "backported" to 5.10 and newer?
5.4 kernel would not be affected as it does not seem to have the offending commit 3a15fb6ed92c ("seccomp: release filter after task is fully dead") which a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") fixes.
Thanks @borkmann for heads up!
It's non trivial to downgrade the kernel downstream when building AMI based on this upstream EKS node AMI which is on kernel 5.10; It would be great that this upstream AMI gets downgraded to kernel 5.4 (at least until memory leak is backported to affected 5.10+ kernels), and anyone that really needs 5.10 or newer and can live with known memory leak, can more easily upgrade the kernel on their own in custom AMI based on the upstream one. WDYT?
I'll defer to AWS folks with regards to your question, Cc @cartermckinnon. Hopefully this can be fixed quickly by cherry-picking the two commits below for EKS 5.10 kernel.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a1140cb215fa13dcec06d12ba0c3ee105633b7c4 https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=10ec8ca8ec1a2f04c4ed90897225231c58c124a7
@borkmann ACK on behalf of @cartermckinnon. please give us some time to do things...
Unfortunately the series of patches we've cherrypicked internally does not seem to resolve the issue. We're still looking into it.
I was not able to reproduce this with 5.15, so we're diff-ing the changelog as well.
It's non trivial to downgrade the kernel downstream when building AMI based on this upstream EKS node AMI which is on kernel 5.10
@stevo-f3 This should do it:
yum versionlock delete kernel
amazon-linux-extras disable kernel-5.10
amazon-linux-extras enable kernel-5.4
yum install -y kernel
At present, we have more users needing 5.10 who are not experiencing this leak than those who are; downgrading the official build to 5.4 would be a last resort if we can't put a fix together.
We can't use 5.15 - recently downgraded to 5.10, with 5.15 were experiencing kernel panics on instance startup, on production only.
Thanks for downgrade to 5.4 instructions, is trivial after all, will use it at least until fix available.
After backporting a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") to our 5.10 kernel, I didn't see any memleak with @essh 's repro. I will release a new kernel with the commit and post the backport patch to the upstream 5.10 tree as well.
Thank you @q2ven!
https://github.com/amazonlinux/linux/commits/kernel-5.10.176-157.645.amzn2 looks promising as it has this patch https://github.com/amazonlinux/linux/commit/fbe210f3421dfc2c8d4b4fc5c34c002099cf0a14
Also as you can see the spam above ^^^: https://lore.kernel.org/all/20230320143725.8394-1-daniel@iogearbox.net/
@essh Can you please this kernel? kernel-5.10.177-158.645.amzn2
and let us know?
I'm running Karpenter and running into this issue, seems to be selecting the latest AMI available for EKS.
Is there any fix I can apply? Very worrying.
@rarecrumb looking at the kernel version of the newest release: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20230411#:~:text=amazon%2Deks/1.22.17//-,AMI%20details,-%3A
it says:
Kubernetes 1.24 and above: 5.10.176-157.645.amzn2
Which is older than the AMI i pointed to above:
kernel-5.10.177-158.645.amzn2
So please wait for next drop or build an AMI with the kernel version above.
cc @mmerkes @cartermckinnon
@rarecrumb looking at the kernel version of the newest release: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20230411#:~:text=amazon%2Deks/1.22.17//-,AMI%20details,-%3A
it says:
Kubernetes 1.24 and above: 5.10.176-157.645.amzn2
Which is older than the AMI i pointed to above:
kernel-5.10.177-158.645.amzn2
So please wait for next drop or build an AMI with the kernel version above.
cc @mmerkes @cartermckinnon
How long until a new AMI is published with this fix in place?
@essh Can you please this kernel?
kernel-5.10.177-158.645.amzn2
and let us know?
I've tried updating to kernel-5.10.177-158.645.amzn2
as suggested but my reproduction is still showing continuing growth when testing with the sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
command. If it's felt this kernel should fix the issue is there anything else I could do to confirm this outside of waiting some number of days for things to start failing?
I can also confirm that issues still persist with 5.10.177-158.645.amzn2
.
After Upgrade EKS to 1.25, we also met this issue. We're using EKS managed group, will the Kernel fix apply to the EC2 node automatically?
Having same issue with EKS 1.26 managed group. Node group AMI version 1.26.2-20230411
Having the same issue with EKS 1.25. AMI version: 5.10.176-157.645.amzn2.x86_64
The same problem for this setup:
Kernel version: 5.10.176-157.645.amzn2.x86_64
Kubelet version: v1.24.11-eks-a59e1f0
On:
Kernel version: 5.4.226-129.415.amzn2.x86_64
Kubelet version: v1.24.7-eks-fb459a0
works great.
is still showing continuing growth
@essh growth at the same pace as before or is it slower growth rate originally?
After Upgrade EKS to 1.25, we also met this issue. We're using EKS managed group, will the Kernel fix apply to the EC2 node automatically?
@Wyifei Once a new AMI is released, you'll need to perform an upgrade on your managed nodegroup to get the fixes.
As for the next AMI release, we're working on preparing the next AMI release and I can confirm that it has kernel 5.10.177-158.645.amzn2
for all variants that use the 5.10
kernel. We'll provide updates on GitHub.
@SimonKO9 what's your scenario? (seccomp? cilium?)
@dims it's seccomp, containers running on a node crash eventually and no more can be started (unless seccomp=unconfined is passed explicitly). We downgraded to 5.4 and we're running without issues now.
Maybe it will be useful for anyone - as a temporary remedy we simply rebooted the nodes which allowed them to run for a day or two until the issue returned. It let us survive the weekend and we could implement the downgrade just today.
I've switched to Bottlerocket
Have the same issue after upgrading to the latest v20230411.
The nodes get the error within less than 1 day!
Reverted to previous v20230406 seems solved the problem
@essh growth at the same pace as before or is it slower growth rate originally?
The pace felt roughly the same but I did not perform a side by side comparison at this stage.
We have the same issue, hoping for a quick fix, otherwise have to go to BottleRocket I guess.
We're following up on this with our kernel folks; we believe we've identified the necessary patches. I'll update here once we've verified and have a kernel build in the pipeline.
Just wanted to make sure you are aware that we see the issue with ECS as well, not just EKS. We have replicated it with kernel 5.10.177-158.645.amzn2
@dougbaber thanks! I'll make sure the ECS team is aware of this issue; any users of recent 5.10 kernel builds would be impacted.
We created a new EKS cluster on version 1.24, After that below error started to show while containers are starting up.
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
Any plans of reverting this to the last stable version till the time AWS finds a fix?
We created a new EKS cluster on version 1.24, After that below error started to show while containers are starting up.
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
Any plans of reverting this to the last stable version till the time AWS finds a fix?
This ☝️
Why is a broken AMI still the default for Amazon's managed node groups?
Can't that be backed out or the release pulled?
Can team rollback the commits and make a new release?
FWIW with eksctl
it is possible to pin previous version with
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: k8s
managedNodeGroups:
- name: nodegroup
releaseVersion: 1.24.11-20230406 # or any other from https://github.com/awslabs/amazon-eks-ami/releases
...
Of course this is very unfortunate bug that renders our nodes unusable within day or two even with increased bpf_jit_limit
and we're hoping for a quick fix.
Kubernetes pods using SECCOMP filtering on EKS optimized AMIs based on Linux Kernel version 5.10.x may get stuck in ContainerCreating
state or their liveness/readiness probes fail with the following error:
unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524
When a process with SECCOMP filters creates a child process, the same filters are inherited and applied to the new process. The Amazon Linux kernel versions 5.10.x are affected by a memory leak that occurs when parent process is terminated while creating a child process. When the total amount of memory allocated for SECCOMP filter is over the limit, a process cannot create a new SECCOMP filter. As a result, the parent process fails to create a new child process and the above error message will be logged.
This issue is more likely to be encountered with kernel versions kernel-5.10.176-157.645.amzn2
and kernel-5.10.177-158.645.amzn2
where the rate of the memory leak is higher.
Amazon Linux will be releasing the fixed kernel by May 1st, 2023. We will be releasing a new set of EKS AMIs with the updated kernel latest by May 3rd, 2023.
Same issue is impacting us, subbed for updates.
Subbing for updates as well, this issue has been impacting us for days now.
Same issue happening on my clusters. Subbing for latest updates.
Hi, the same for me, with EKS 1.26 AMI version: 5.10.176-157.645.amzn2.x86_64
Hey guys, appreciate you're all subbing but think of the people that are already subbed getting all these pointless messages.
If you're not gonna add any information that's relevant to the resolution of the issue please refrain from sending another message and just click the subscribe button.
The kernel fix seems to have been released now, as 5.10.178-162.673.amzn2.x86_64.
Yes, it's available. Folks that manage custom AMIs can start using the kernel and we're preparing AMIs for release on Wednesday that will include the latest kernel.
The v20230501
release has started now, and it includes 5.10.178-162.673.amzn2.x86_64
for all AMIs that use 5.10 kernels. We have tested the kernel and expect it to resolve this issue for customers. New AMIs should be available in all regions late tonight (PDT).
What happened:
After upgrading EKS nodes from
v20230203
tov20230217
on our1.24
EKS clusters after a few days a number of the nodes have containers stuck inContainerCreating
state or liveness/readiness probes reporting the following error:This issue is very similar to https://github.com/awslabs/amazon-eks-ami/issues/1179. However, we had not been seeing this issue on previous AMIs and it only started to occur on
v20230217
(following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads.We tried the suggestions from that issue (
sysctl net.core.bpf_jit_limit=452534528
) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned bycat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
was steadily increasing.What you expected to happen:
Ready
How to reproduce it (as minimally and precisely as possible):
I don't currently have a reproduction that I can share due to my current one using some internal code (I can hopefully produce a more generic one if required when I get a chance).
As a starting point we only noticed this happening on nodes that had pods scheduled on them which had an
exec
liveness & readiness probe running every 10 seconds that performs a health check against a gRPC service usinggrpcurl
. In addition to this we also have a default Pod Security Policy (yes we know they are deprecated 😄) that has the following annotationseccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default
.These two conditions seem to be enough to trigger this issue and the values reported by
cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
will steadily increase over time until containers can no longer be created on the node.Anything else we need to know?:
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
):"eks.4"
aws eks describe-cluster --name <name> --query cluster.version
):"1.24"
v20230217
uname -a
):5.10.165-143.735.amzn2.x86_64 #1 SMP Wed Jan 25 03:13:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/eks/release
on a node):Official Guidance
Kubernetes pods using SECCOMP filtering on EKS optimized AMIs based on Linux Kernel version 5.10.x may get stuck in
ContainerCreating
state or their liveness/readiness probes fail with the following error:When a process with SECCOMP filters creates a child process, the same filters are inherited and applied to the new process. The Amazon Linux kernel versions 5.10.x are affected by a memory leak that occurs when parent process is terminated while creating a child process. When the total amount of memory allocated for SECCOMP filter is over the limit, a process cannot create a new SECCOMP filter. As a result, the parent process fails to create a new child process and the above error message will be logged.
This issue is more likely to be encountered with kernel versions
kernel-5.10.176-157.645.amzn2
andkernel-5.10.177-158.645.amzn2
where the rate of the memory leak is higher.Amazon Linux will be releasing the fixed kernel by May 1st, 2023. We will be releasing a new set of EKS AMIs with the updated kernel latest by May 3rd, 2023.