EKS nodes lose readiness when containers exhaust memory

dasrirez commented 1 year ago

What happened:

When our applications consume too much memory, K8s nodes on EKS clusters lose readiness and become completely inoperable for extended periods of time. This means that instead of being rescheduled immediately, pods remain stuck in a pending state, resulting in noticeable downtime. This does not happen on GKE clusters.

What you expected to happen:

Nodes should never lose readiness, instead the containers should be restarted and/or the pods should be OOMKilled.

How to reproduce it (as minimally and precisely as possible):

Provision a single node EKS cluster running an EC2 instance type of m5.large.

Apply the following deployment resource.

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  labels:
    app: stress-ng
  name: stress-ng
spec:
  selector:
    matchLabels:
      app: stress-ng
  template:
    metadata:
      labels:
        app: stress-ng
    spec:
      containers:
      - args:
        - -c
        - stress-ng --bigheap 0
        command:
        - /bin/bash
        image: alexeiled/stress-ng:latest-ubuntu
        name: stress-ng

Observe the node lose readiness.

$ k get nodes
NAME                        STATUS   ROLES    AGE     VERSION
ip-10-11-0-6.ec2.internal   NotReady   <none>   2m14s   v1.24.7-eks-fb459a0

Anything else we need to know?:

We've traced the cause of this problem to the memory reservation of kubelet, which is set by the bootstrap script here. https://github.com/awslabs/amazon-eks-ami/blob/eab112a19877122e46a706d3a91d42b85218f268/files/bootstrap.sh#L452

This is what kubeReserved is set to by default on this cluster.

  "kubeReserved": {
    "cpu": "70m",
    "ephemeral-storage": "1Gi",
    "memory": "574Mi"
  }

Note the memory reservation, on a GKE cluster this value would be 1.8Gi https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#eviction_threshold.

When running with kubeReserved.memory="574Mi", kubelet logs indicate PLEG errors during memory exhaustion and the node loses readiness.

[root@ip-10-11-0-6 ~]# journalctl -u kubelet | grep -i pleg | grep -v SyncLoop | grep -v Generic
Jan 09 20:01:32 ip-10-11-0-6.ec2.internal kubelet[27632]: E0109 20:01:32.726177   27632 kubelet.go:2013] "Skipping pod synchronization" err="PLEG is not healthy: pleg has yet to be successful"
Jan 09 20:06:25 ip-10-11-0-6.ec2.internal kubelet[4077]: E0109 20:06:25.149411    4077 kubelet.go:2013] "Skipping pod synchronization" err="PLEG is not healthy: pleg has yet to be successful"
Jan 09 20:08:01 ip-10-11-0-6.ec2.internal kubelet[12256]: I0109 20:08:01.432398   12256 setters.go:546] "Node became not ready" node="ip-10-11-0-6.ec2.internal" condition={Type:Ready Status:False LastHeartbeatTime:2023-01-09 20:08:01.424711657 +0000 UTC m=+1.288302571 LastTransitionTime:2023-01-09 20:08:01.424711657 +0000 UTC m=+1.288302571 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
Jan 09 20:08:01 ip-10-11-0-6.ec2.internal kubelet[12256]: E0109 20:08:01.635341   12256 kubelet.go:2013] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"

The problem does not occur when the node is bootstrapped with kubeReserved.memory="1.8Gi". Also seems to be fine running kubeReserved.memory="1Gi", but that value is arbitrary and not tested.

Environment:

AWS Region: us-east-1
Instance Type(s): m5.large
EKS Platform version: eks.3
Kubernetes version: 1.24
AMI Version: ami-0c84934009677b6d5
Kernel: Linux ip-10-11-0-6.ec2.internal 5.4.219-126.411.amzn2.x86_64 #1 SMP Wed Nov 2 17:44:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Release information:

BASE_AMI_ID="ami-0c90e1cacda57dac6"
BUILD_TIME="Sat Nov 12 04:15:25 UTC 2022"
BUILD_KERNEL="5.4.219-126.411.amzn2.x86_64"
ARCH="x86_64"

dasrirez commented 1 year ago

Noting that the problem occurs in both containerd and docker based AMIs.

cartermckinnon commented 1 year ago

Thanks for the detailed issue! We definitely need to revise our kubeReserved values, memory being a simple function of $MAX_PODS should probably be revisited.

The GKE values seem fairly conservative to me; reserving ~23% (up from ~7% currently) of the available memory on smaller instance types isn't a change we should make without our own testing.

dasrirez commented 1 year ago

Ack, thanks for taking a look at this issue! I think the GKE values can serve as a safe upper bound on the reserve limits, but they can probably be better optimized as mentioned. The fact that the node was able to operate with just 1Gi of memory reserved instead of 1.8Gi seems to be a quick hint for this.

It looks GKE values were used until this patch https://github.com/awslabs/amazon-eks-ami/pull/419 was merged but that was a long time ago and tested on 1.14 clusters. It might be worth while to revert back to the GKE values as a quick fix until a better model is tested/developed though I say that without a good understanding of the timeframe for the latter fix.

maximethebault commented 1 year ago

Also seeing the issue: #1098

Will close my issue in favor of this more detailed issue. It's good to see some attention on this, it's worrisome to see this happen on production workload when you use all AWS-recommended / default settings and when you always thought the EKS + EKS AMI combo would protect you against these kind of situations thanks to the automated configuration of memory reservation. Even more so after reading the K8S docs about system reserved memory, kube reserved memory, soft eviction, hard eviction: you feel confident there would be multiple barriers to breach before getting into that situation. Yet here we are.

Thanks for looking into this @cartermckinnon!

For what it's worth, here are the settings we ended up using to protect against this (Karpenter allows us to easily add these values when launching nodes): systemReserved: 300Mi evictionSoft: memory.available: 3% evictionHard: memory.available: 2%

These values are completely empiric and far from having been scientifically determined, nor were they tested on a lot of instance types / architectures / container runtimes / etc, but they did the job for us :)

bryanasdev000 commented 1 year ago

Same here, alt ought in my scenario are many small pods, every time that there's an OOM outside cgroup the node goes down and generally takes half an hour to be back.

A good default value will be hard to set without sacrificing the node capacity, but I think a good step may be like @maximethebault with karpenter, allowing we to configure it with an LT, similar to using containerd instead of docker or setting max pods (which by the way, we use at 110 with Cilium CNI).

stevehipwell commented 1 year ago

CC @bwagner5

stevehipwell commented 1 year ago

There are some discussions about this on the following issues.

davidroth commented 1 year ago

Experienced the same problem with t3.medium instances. Can somebody explain why this happens even though there is reserved memory for kubelet? It feels strange that although there is reserved memory for the system processes, a single pod can easily kill the whole node.

stevehipwell commented 1 year ago

@davidroth the node logic is tied to the old assumption that you can only run as many pods as there are ENI IPs, which was always slightly incorrect (host port pods) but with the introduction of IP prefixes is now very much incorrect (in comparison to the accepted values everywhere else). I think this issue is most likely to impact smaller instances due to the under provisioning of kube reserved. We've used custom node logic limiting all pods to 110 pods and using the GKE memory calculation and haven't seen an issue since. You will also want to make sure your pods are setting valid requests/limits.

schmee-hg commented 1 year ago

Just experienced this as well today in our test environment, also with t3.medium instances. If this is not going to be solved in the near term I think there should be a big warning somewhere in the EKS docs (or even in the console) that t3.medium and other instances affected by this are not suitable for production workloads, cause this could have been a disaster if it was production.

seyal84 commented 1 year ago

Already experiencing this in eks v1.23 and m5.8xlarge instance type . When is it actually going to be resolved ? I still feel like we are not hitting the root cause of this issue.

jortkoopmans commented 6 months ago

Also seeing this issue on t3.medium on stock settings using AWS Managed Nodegroups (1.24). Without using prefix delegation. The discussion on how to deal with user defined --max-pods is very important, because this is gaining popularity (also through adoption of Karpenter). There are several tickets on this, but generally we need an alternative to using the fixed max-pod-eni.txt values.

But this issue is more critical as it is happening also with stock settings, t3.medium having 17 pods, using (latest) standard AL2 ami. With some memory pressure, it will flap readiness or even crash.

davidroth commented 6 months ago

@jortkoopmans I discovered that it helps to check that all pods have their memory limits configured correctly. It is important that the configured requested memory is the same as the configured limit memory.

Example:

 resources:
    requests:
      memory: "400Mi"
    limits:
      memory: "400Mi"

In the future, with cgroups v2 and the completion of the Quality of Service for Memory Resources, it will probably be possible to configure memory limits higher than the requested memory, as cgroup v2 memory throttling kicks in.

Until now, I have only been able to achieve stable nodes by setting the requests to limits..

stevehipwell commented 6 months ago

@davidroth in a correctly configured Kubernetes cluster the pod resources shouldn't be able to make a node become unstable; that what system and kube reserved should be handling.

At the same time; as memory is (currently) uncompressible it's good practice to use the same value for limits as requests. A likely side effect of this is that all pods have additional overhead reserved which the node can then use to increase on it's default reserved values. This is exactly what we saw while we were still using the default kube reserved values with prefix delegation supporting 110 pods per node; clusters with pods that had resources exactly specified were much less likely to have a node problem than those with unconstrained or highly burstable pods.

Our solution to this was to implement the GKE kube reserved memory calculation as part of our node configuration which completely stopped this issue (FYI AKS also uses this and EKS & AKS use the GKE CPU calculation). The GKE calculation takes a proportion of node resources and I've not seen a node become unstable from this cause since we made this change.

However with Karpenter only supporting a fixed value for kube reserved we've been struggling to find a solution so that we can get Karpenter into production. At the same time as this AKS announced a (still pending for v1.29) move towards a per-pod calculation of kube reserved, but with a much higher per-pod cost that the EKS (20Mb vs 11Mb), which made me re-evaluate the problem.

So back to first principals we get the following statements.

A per-pod kube reserved calculation (+ a fixed value) should be correct
- This is expensive as a node supporting the K8s default 110 pods would need a significant memory reservation
The current EKS calculation isn't correct/safe
- Pods have an overhead greater than 11Mb (based on failure on node using the ENI mode and according to AKS 20Mb)
- The high initial fixed value (255Mb) makes the calculation more robust for nodes with fewer pods
- ENI mode is "more" correct by virtue of having a low pod density so the fixed value covers the gap better
- Prefix mode is very incorrect
The GKE algorithm is robust but potentially wasteful
- As it's not per-pod

My summary of the above is that for a "correct" solution the resources need to be configured per pod, but due to other considerations this is very hard and the general solution has been to use second order effects to build a good enough solution. These second order effects require certain constraints to be in place and once they aren't there is the potential for nodes to become unstable; the examples here for the EKS calculation are small nodes supporting 110 pods with a large number of those pods being deployed as burstable, and large nodes with more than about 25 pods using all of the available node resources.

As a number of things have changed since the above algorithms were created I think that we can solve this problem without needing to rely on second order effects. If we combine a static system & kube reserved configuration for the nodes and then introduce a runtime class with pod overheads we can directly model the system as it is rather than how it could be. This results in nodes which can better make use of their available resources as resources are only reserved when they're needed. The issue here is that there isn't a concept of a default runtime class so that would need to be added with a webhook (but mutating policy via CEL is WIP).

@cartermckinnon could we get some numbers published about the node resource utilisation; ideally with no pods running and then per pod.

zip-chanko commented 2 months ago

Experiencing the same. I also notice that there isn't anything set for systemReserved in the config. I am thinking a temporary workaround to increase the evictionHard. Do we know if this is achievable without baking a new AMI?

kubelet-config.json

```json { "kind": "KubeletConfiguration", "apiVersion": "kubelet.config.k8s.io/v1beta1", "address": "0.0.0.0", "authentication": { "anonymous": { "enabled": false }, "webhook": { "cacheTTL": "2m0s", "enabled": true }, "x509": { "clientCAFile": "/etc/kubernetes/pki/ca.crt" } }, "authorization": { "mode": "Webhook", "webhook": { "cacheAuthorizedTTL": "5m0s", "cacheUnauthorizedTTL": "30s" } }, "clusterDomain": "cluster.local", "hairpinMode": "hairpin-veth", "readOnlyPort": 0, "cgroupDriver": "systemd", "cgroupRoot": "/", "featureGates": { "RotateKubeletServerCertificate": true, "KubeletCredentialProviders": true }, "protectKernelDefaults": true, "serializeImagePulls": false, "serverTLSBootstrap": true, "tlsCipherSuites": [ "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_RSA_WITH_AES_256_GCM_SHA384", "TLS_RSA_WITH_AES_128_GCM_SHA256" ], "registryPullQPS": 20, "registryBurst": 40, "clusterDNS": [ "172.20.0.10" ], "kubeAPIQPS": 10, "kubeAPIBurst": 20, "evictionHard": { "memory.available": "100Mi", "nodefs.available": "10%", "nodefs.inodesFree": "5%" }, "kubeReserved": { "cpu": "110m", "ephemeral-storage": "1Gi", "memory": "2829Mi" }, "maxPods": 234, "providerID": "aws:///ap-southeast-2c/i-123456789", "systemReservedCgroup": "/system", "kubeReservedCgroup": "/runtime" } ```

stevehipwell commented 2 months ago

@zip-chanko as the limits aren't being enforced the combined values for kube and system reserved can be treated as a single unit.

tooptoop4 commented 1 day ago

dis a doozie

awslabs / amazon-eks-ami

EKS nodes lose readiness when containers exhaust memory #1145