AKS maintained Pods should not overcommit memory

mblaschke commented 3 years ago

What happened: Many AKS maintained Pods are running with memory overcommitment, eg:

omsagent (Daemonset; up to 375 MB)
coredns (Deployment, up to 100MB)
kube-proxy (Daemonset; unlimited?!)
azure-policy (Deployment; up to 150MB)
azure-policy-webhook (Deployment, up to 150MB)
and others

for example OMS agent pods (from daemonset omsagent) are running with:

       resources:
          limits:
            cpu: 500m
            memory: 600Mi
          requests:
            cpu: 75m
            memory: 225Mi

on high load and if node memory is used eg. 99% (which is possible with Kubernetes) this might trigger an OOM Killer on the host (not on the pod!) and so might affect other Pods!

What you expected to happen:

be fair and set limit and request for memory to the same values. don't overcommit memory.

       resources:
          limits:
            cpu: 500m
            memory: 600Mi
          requests:
            cpu: 75m
            memory: 600Mi

How to reproduce it (as minimally and precisely as possible):

spin up AKS in 1.18 or 1.19 w/ and w/o services (eg. omsagent, policy agent,...)

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.18.14 (also with 1.19.7)
Size of cluster (how many worker nodes are in the cluster?) 3
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
Others:

ghost commented 3 years ago

Hi mblaschke, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 3 years ago

Triage required from @Azure/aks-pm

ghost commented 3 years ago

@az-policy-kube would you be able to assist?

Issue Details

**What happened**: Many AKS maintained Pods are running with memory overcommitment, eg: - omsagent (Daemonset; up to 375 MB) - coredns (Deployment, up to 100MB) - kube-proxy (Daemonset; unlimited?!) - azure-policy (Deployment; up to 150MB) - azure-policy-webhook (Deployment, up to 150MB) - and others for example OMS agent pods (from daemonset `omsagent`) are running with: ``` resources: limits: cpu: 500m memory: 600Mi requests: cpu: 75m memory: 225Mi ``` on high load and if node memory is used eg. 99% (which is possible with Kubernetes) this might trigger an OOM Killer on the host (not on the pod!) and so might affect other Pods! **What you expected to happen**: be fair and set limit and request for memory to the same values. don't overcommit memory. ``` resources: limits: cpu: 500m memory: 600Mi requests: cpu: 75m memory: 600Mi ``` **How to reproduce it (as minimally and precisely as possible)**: spin up AKS in 1.18 or 1.19 w/ and w/o services (eg. omsagent, policy agent,...) **Anything else we need to know?**: **Environment**: - Kubernetes version (use `kubectl version`): 1.18.14 (also with 1.19.7) - Size of cluster (how many worker nodes are in the cluster?) 3 - General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) - Others:

Author:	mblaschke
Assignees:	-
Labels:	`action-required`, `addon/policy`, `azure/policy`, `triage`
Milestone:	-

miwithro commented 3 years ago

Here is a document that talks about resource-reservations to the specific K8s components.
https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations

I would recommend to use system node pool for the system pods (deployments), and application node pool for customer workloads. This will reduce the impact from the system components to their application. https://docs.microsoft.com/en-us/azure/aks/use-system-pools#system-and-user-node-pools

The challenge here is about the daemonset, since it runs on all the nodes, and we need to support various kind of clusters, small/large, different VM SKUs on different agent pools. For example, the smallest VM SKU we support is F2 which has 4 GB, and ~2GB allocatable memory. We usually set the memory request to the minimum footprint of the pod, and set the memory limit to a large number from the “large cluster” scaling test. If we increase the request to limit, ds pods might not fit on the F2 VM.

The omsagent memory usage depends on how many log is written by all the pods running on the node, mostly contributed by customer’s application pods.

Adding Policy team to comment on there pods.

mblaschke commented 3 years ago

The problem here is: There is no memory overcommit. You can overcommit CPUs but not memory (if you don't have swap which is strongly not recommended by Kubernetes).

So if the application Pods are using 99% of memory and eg. omsagent or even kubeproxy (which is not reserved at all) uses 2% more memory this will trigger the OOM killer on the host and kills a container or application more or less randomly.

IMHO the options are:

1) make it configurable (eg. implement an operator) how much the customer wants to reserve/limit (or maybe the customer doesn't care) 2) set request = limit 3) or let the customer deal with the fact that this can trigger an OOMkiller on the host and randomly 4) the customer has to deploy a pause pod which reserves eg. 500Mi to deal with the fact that there are apps which are not reserving their memory in a proper way

I'm not that happy about the statement with system and user node pools. The argument (from the microsoft conultant) for AKS was that the customer doesn't have to care for the master and now it feels like there are "masters" (sort of) again. This would mean that system nodes cannot be used for anything else (because they are "unstable") and would reduce the total node count by one pool (only having 9 pools left for applications). With proper memory reservation there is no real reason not to use these nodes :)

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

miwithro commented 3 years ago

@juan-lee @robbiezhang

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

mblaschke commented 2 years ago

prometheus-operator always alerts on those Pods:

52.5% throttling of CPU in namespace kube-system for container openvpn-client in pod aks-link-7676547cc6-5lfgd.

This should be fixed or somehow we should be able to configure it.

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

miwithro commented 2 years ago

@Kaarthis FYI.

kaarthis commented 2 years ago

Applications might use more memory in the startup time, but reduce later after GC. If we set the memory request/limit low, it maybe OOMKilled (as highlighted), and cause the pod fail to start. if AKS set the request/limit high for start up, it may cause memory utilization low in the runtime and waste of the VM compute power. Therefore, AKS strategy is to set the request to close to the regular usage, limit to the spike usage.

The memory consumption pattern for different applications are different. It will be different for different clusters as well. AKS today provides a high limit based on some estimation analysis. Having said that there is definitely an opportunity in providing a configurable memory limits for customers - This is something we shall consider in our Roadmap this year, though there is no immediate solution available.

mblaschke commented 2 years ago

In Kubernetes you can use nearly 100% of the Node memory by reserving it for Pods. So applications can reserve their memory to the limit what they are using and so improve the performance.

Unlike CPU there is no memory overcommitment so if Pods are using 99% of VM's memory (because it was reserved that way) the AKS maintained Pods can run into the overcommitment (use more than reserved) and trigger an Kernel OOM killer (instead of the cgroup memory killer) which kills innocent Pods/Containers.

To solve it: Don't overcommit memory because there is no way to overcommit it (because there is no swap).

Workaround: The customer has to reserve eg. 5% to 10% of memory on every system node because AKS is not caring about memory reservations. We're not talking about some megabytes, we're talking about more then 1 GiB of memory.

Either solve your memory reservation problem or make it configurable for the customer. But simply overcommit a resource which is not overcommit able is not really a professional way to run a managed Kubernetes cluster.

randywatson1979 commented 2 years ago

I was wondering if this might result into issues after node upgrades via Azure portal when memory usage might spike up above 100%. Especially when automatic nodescaling is disabled.

I've seen one node where for example where a few pods were scheduled based on their request, but their limits are causing huge overcommitment.

It is likely a choice whether or not you want to set the memory request a bit closer to the limit in order to prevent too much overcommitment and run reliably during huge memory workload. Ofcourse the tradeoff is that you will need more/better spec nodes. It depends wether you are using the node for development and testing or production. With production you might want a tighter request and limit spec.

What I also encounter is one node using memory above 100%, stated by kubectl top nodes. This might be due to nodes set to a fixed number where it is not able to autoscale. However if you add one node more, it does not evict the pod and schedule on the new node.

ghost commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

mblaschke commented 2 years ago

bump

mblaschke-daimlertruck commented 2 years ago

example gatekeeper-system (runs only on system nodes): audit (1 Pod): 256 Mi reseverd, 3 Gi limit controller (2 Pods): 256 Mi reserved, 2 Gi limit

so gatekeeper can use up 6.25 Gi of memory which is never reserved.

example kube-system/omsagent-* (runs on all nodes):

1 Pod per node container omsagent: 325 Mi reserved, 750 Mi limit container omsagent-prometheus: 225 Mi reserved, 1 Gi limit

commited memory: 550 Mi overcommitted memory: 1.2 Gi (worst case) average memory usage in our clusters: ~800 Mi

holotrek commented 2 years ago

I'm running into this too. I'm using Standard_B2s with 4 GiB of memory. These are the kube-system pod requests/limits:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                1158m (60%)   10550m (555%)
  memory             1982Mi (91%)  11724Mi (543%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)

In reality they use only about 40%, but simply adding a new pod with any sort of request causes it to overcommit and fail with OOMKilled status. Running multiple instances of the same pod without resource requests/limits works fine.

ghost commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

marcindulak commented 1 year ago

:eyes:

aldycool commented 1 year ago

I've tried patching the daemonsets / deployments with lower values, but they reset back to the original ones after re-deployment. This is greatly reducing on how much customers can use the (paid) resources...

mblaschke commented 1 year ago

@aldycool Lower values might not help as you will run into overcommitment of memory. Memory CANNOT be overcommitted, but CPU can (containers might be slow or starve regarding to processing power).

If you don't reserve the memory you're using and hit the 100% limit the Kernel will kill randomly your containers with a SIGKILL. This issue is about the fact that AKS services doesn't reserve the memory they are using and so you have to be careful with your memory usage.

Kubernetes can and will use all memory you're reserving, if you overcommit memory you're risking Kernel OOM killer.

(You could enable swap but Kubernetes is not recommending to do this as you will not get consistent performance anymore.

aldycool commented 1 year ago

@mblaschke Yes, actually my point is I was surprised that we as customers cannot change the preset values of the Azure specific resources manually, either memory or cpu. Because in my case, the offending part for me is the CPU limits reserved for the AKS resources. By default, Azure AKS deploys these containers with huge CPU limits (instance count is assuming using 3 nodes in a node pool):

(kube-system, daemonset) azure-ip-masq-agent = 500m (3 instances)
(kube-system, daemonset) cloud-node-manager = 2000m (3 instances)
(kube-system, deployment) coredns = 3000m (2 instances)
(kube-system, deployment) metrics-server = 1000m (2 instances)

If I only have nodes with spec just only 4 CPU per each node, that means the left over for CPU limits for each node is only 1500m (4000m is consumed by cloud-node-manager + azure-ip-masq-agent). I realized this because my Prometheus is throwing alerts of CPU overcommit:

Labels
alertname = KubeCPUOvercommit
prometheus = monitoring/kube-prometheus-stack-prometheus
severity = warning
Annotations
description = Cluster has overcommitted CPU resource requests for Pods by 3.1319999999999997 CPU shares and cannot tolerate node failure.
runbook_url = https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit
summary = Cluster has overcommitted CPU resource requests.

romlod commented 1 year ago

Is there already some idea how to deal with this? In my opinion the ranges are making no sense at all. We want to follow some best practices (e.g. 125% CPU overcommitted) but with these ranges, we already have near 40% committed on a 8vCPU machine just because of these Pods.

seguler commented 1 year ago

Hi @romlod, others

This feedback is valid. However, I'd like to understand a bit further about the overcommitment.

We strongly recommend customers use a dedicated system nodepool for their cluster to be able to host these system pods. This is to ensure the system stability and reserve (I recognize the overcommitment) resources to be able to reliably handle the peaks in load. Why are there peaks in consumption? Partly, we don't own the code for many of these resources and spikes in usage isn't something we can always tune. And they happen.

If you run your user pods in the same nodepool, you risk saturating resources on the system and causing system instability. E.g. one runaway pod could result in downtime for all of the cluster if you evict or fail the CoreDNS pods.

Given this, we preferred reliability over utilization and did overcommit resources. In @romlod's case, you could run a system nodepool with 4CPU machines and improve utilization while you run your applications on a user pool. Does this help with your concern?

I still recognize that you should be able to customize some of these settings in case you are running a very busy cluster and the limits set are not accommodative. But I don't know if I fully understand why overcommitment is an issue. E.g. if we were to set requests=limits, then you couldn't run a cost-optimized small cluster and would at least need a 4 vCPU x 3 nodepool just for running system pods.

romlod commented 1 year ago

Hello. We use a dedicated system node pool. However, even if running a node in a user node pool, each of these nodes minimally contain pods like omsagent, csi-*, kube-proxy, cloud-node-manager and azure-ip (maybe even more). If you look at the pod cloud-node-manager, it has a cpu requests of 50m but csn burst to 2000.

By having these kind of numbers, no need to take this into account. So having a system node pool wouldnt solve this because this is by default.

Next to that, if this is by default, then I really think that the documentation should be changed so that everyone can do 1) a proper sizing and 2) know what they are dealing with when choosing AKS

mblaschke-daimlertruck commented 1 year ago

@seguler AKS Pods are running on overcommit even on non System Nodes. As a possible fix AKS users have to run a pause container which just reserve the overcommit memory and only run AKS pods on system nodes and nothing else. This leads to the question why system nodes are not tainted for other pods if they should not been used for other workloads.

But also on System nodes the AKS Pods are overcommitting memory and this would need swap to be enabled as a possible fix (and a very very bad solution). If AKS Pods are using too much memory it could cause a Kernel OOM Killer so this could even mean we're losing logs or causing service outages.

see https://github.com/Azure/AKS/issues/2125#issuecomment-1176242678

another example: calico-nodes, calico-typha and tigera-operator are coming totally WITHOUT resource reservation and so consuming non reserved memory on ALL nodepools. Calico could consume 100% of node memory and so lead to a total service outage.

I'm wondering why nobody from AKS team is actually responding to the actual issue but only referencing "use system nodes" 🤔

-tl/dr- AKS product group is not following best practices for Kubernetes regarding resource reservations and puts customer workload at risk (Kernel OOM Killer). https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits?hl=en

aldycool commented 1 year ago

I know that the system node pools recommendation is already there in the docs: https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli#system-and-user-node-pools, but I think it needs more explanation, why those restrictions exists in the first place. Of course it already mention information like the minimum CPU, Memory, etc. But, I think for more advanced customers, more detailed and technical reasons should also be mentioned there, especially the current behavior of these AKS pods, so customers will know in advance of what are they getting into.

mblaschke-daimlertruck commented 1 year ago

the AKS opensource mesh addon (osm) doesn't reserve any resources for the envoy sidecars, too. That means the osm addons is not really production ready as it could trigger kernel OOM killers.

keithmattix commented 1 year ago

the AKS opensource mesh addon (osm) doesn't reserve any resources for the envoy sidecars, too. That means the osm addons is not really production ready as it could trigger kernel OOM killers.

The OSM addon actually allows users to change the sidecar resource requests/limits via a CRD called Config v1alpha2 API Reference. We do this because the resource utilization of the sidecar varies wildly depending on the cluster.

mblaschke-daimlertruck commented 1 year ago

@keithmattix if I deploy OSM via helm-chart i would expect that there are no sane defaults and i have to care for the service. My deployment, my responsibility and my "production readiness" problem.

But what is the scope of a "managed service"? As a managed service user I would expect sane defaults and documentation how to get the managed service into a production ready state and how to tweak the defaults. But if the managed service is not providing production readiness by default what's then the benefit for using a managed service? So what does "GA/production ready" mean for all the AKS services?

That also applies for example to the calico addon.

keithmattix commented 1 year ago

I think I see your point; we could provide an arbitrary request as an upper bound to assist in scheduling. I am pretty hesitant to include a limit by default because of the potential for disruption (since Envoy does run as a sidecar, when it gets OOMKilled, the application container gets restarted as well).

mblaschke-daimlertruck commented 1 year ago

we don't care so much about cpu overcommit as the cpu can be overcommited but memory cannot. If we run out of memory it will trigger a Kernel OOM killer (!= cgroup/container OOM killer) and will randomly kill workload with a SIGKILL which could also lead to data loss.

Currently all AKS clusters are deployed in a way as it would use swap even if no swap is configured and it's actually not recommended by Kubernetes for several reasons. So memory needs to be reserved or we risk production workload to be killed by Kernel OOM killer with a SIGKILL.

here the current situation with an AKS 1.24 cluster with previews (keda, vpa) also enabled:

namespace	pod	type	memory overcommit/pod
kube-system	coredns-autoscaler-*	deployment	944 MiB
kube-system	azure-policy-webhook-*	deployment	150 MiB
kube-system	csi-azuredisk-node-*	daemonset	340 MiB
kube-system	csi-azurefile-node-*	daemonset	540 MiB
kube-system	cloud-node-manager-*	daemonset	462 MiB
kube-system	coredns-*	deployment	430 MiB
kube-system	keda-operator-*	deployment	900 MiB
kube-system	keda-operator-metrics-apiserver-*	deployment	900 MiB
kube-system	konnectivity-agent-*	deployment	108 MiB
kube-system	metrics-server-*	deployment	270 MiB
kube-system	ama-logs-*	daemonset	1224 MiB
kube-system	ama-logs-rs-*	deployment	774 MiB
kube-system	osm-controller-*	deployment	832 MiB
kube-system	vpa-admission-controller-*	deployment	300 MiB
kube-system	vpa-recommender-*	deployment	500 MiB
kube-system	vpa-updater-*	deployment	500 MiB
gatekeeper-system	gatekeeper-audit-*	deployment	2816 MiB
gatekeeper-system	gatekeeper-controller-*	deployment	1792 MiB

so our testing cluster (with AKS 1.24) was ramped up with 5 nodes and AKS is overcommitting memory by 27626 MiB.

Every node is overcommited with ~2500 MiB memory via Daemonsets (independent from pool type).

Using system nodes without any additional payload are not actually solving it as this puts even AKS system pods (eg. from deployments) on risk to get killed by Kernel OOM killer.

eg: coredns-autoscaler runs with request of 10MiB but is allowed to use up to 1000MiB before getting killed.

For better understanding: If your harddrive is full, you cannot sqeeze more bytes onto it and your writes will fail. The same applies to memory but here the Kernel is just sending you a not so friendly OOM Killer to solve the situation.

As a AKS customer I would like to have Microsoft to take care for my production workload that it's not getting a victim of a SIGKILL by the Kernel OOM Killer.

nickbrennan1 commented 1 year ago

We appear to have an issue where we're using defined requests=limits some way below the allocation threshold - 112G limit on 115G allocatable with worker nodes sized at 125G, causing significant service outages when the OOM killer kills our application after some AKS internal process starts operating at load.

If we're using guaranteed pods for the application, it should be possible to place the same limits on AKS pods so we can effectively tune overall memory for the required workload, or at the very least have a clearly defined terms of reference for the AKS requirements as Markus has detailed above

apgapg commented 1 year ago

My kube system takes too much resources on nodes, this leads to very less res available for pods deploy, which leads to multiple nodes in nodepool just for few pods

kaarthis commented 1 year ago

@olsenme , @robbiezhang can one of you respond ?

RooMaiku commented 1 year ago

Hello to everyone in this thread👋

This issue is recognized as a legitimate problem, and we are actively working on a design to enable more intelligent scaling for AKS maintained pods no matter the cluster size and composition. There is a fair amount of complexity inherent to a general solution as there is no one size fits all scaling strategy for each AKS managed addon.

ghost commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

marcindulak commented 1 year ago

:eyes:

olemarkus commented 10 months ago

Any updates on where you with resolving this problem?

miztiik commented 7 months ago

Can we have an update?

rajivreddy commented 5 months ago

We still have the same Issue

aks-default-98530742-vmss000000   *                   *                                           1025m (53%)    6950m (365%)     1660Mi (36%)      12380Mi (272%)
aks-default-98530742-vmss000000   kube-system         aks-secrets-store-csi-driver-vx4df          70m (3%)       300m (15%)       140Mi (3%)        500Mi (11%)
aks-default-98530742-vmss000000   kube-system         aks-secrets-store-provider-azure-zcfbj      50m (2%)       100m (5%)        100Mi (2%)        100Mi (2%)
aks-default-98530742-vmss000000   kube-system         ama-logs-ttk26                              170m (8%)      1100m (57%)      600Mi (13%)       1874Mi (41%)
aks-default-98530742-vmss000000   kube-system         ama-metrics-ksm-84dbc9cbc8-4d9z6            5m (0%)        1000m (52%)      50Mi (1%)         5120Mi (112%)
aks-default-98530742-vmss000000   kube-system         ama-metrics-node-n7fjz                      70m (3%)       700m (36%)       180Mi (3%)        1524Mi (33%)
aks-default-98530742-vmss000000   kube-system         azure-ip-masq-agent-zpggf                   100m (5%)      500m (26%)       50Mi (1%)         250Mi (5%)
aks-default-98530742-vmss000000   kube-system         azure-npm-6m56g                             250m (13%)     250m (13%)       300Mi (6%)        1000Mi (22%)
aks-default-98530742-vmss000000   kube-system         cloud-node-manager-s4tzx                    50m (2%)       0m (0%)          50Mi (1%)         512Mi (11%)
aks-default-98530742-vmss000000   kube-system         coredns-fb6b9d95f-6kqwt                     100m (5%)      3000m (157%)     70Mi (1%)         500Mi (11%)
aks-default-98530742-vmss000000   kube-system         csi-azuredisk-node-4dmbq                    30m (1%)       0m (0%)          60Mi (1%)         400Mi (8%)
aks-default-98530742-vmss000000   kube-system         csi-azurefile-node-jb9kg                    30m (1%)       0m (0%)          60Mi (1%)         600Mi (13%)
aks-default-98530742-vmss000000   kube-system         kube-proxy-b5mdm                            100m (5%)      0m (0%)          0Mi (0%)          0Mi (0%)

microsoft-github-policy-service[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

Azure / AKS

AKS maintained Pods should not overcommit memory #2125