Open mblaschke opened 3 years ago
Hi mblaschke, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
Triage required from @Azure/aks-pm
@az-policy-kube would you be able to assist?
Author: | mblaschke |
---|---|
Assignees: | - |
Labels: | `action-required`, `addon/policy`, `azure/policy`, `triage` |
Milestone: | - |
Here is a document that talks about resource-reservations to the specific K8s components.
https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations
I would recommend to use system node pool for the system pods (deployments), and application node pool for customer workloads. This will reduce the impact from the system components to their application. https://docs.microsoft.com/en-us/azure/aks/use-system-pools#system-and-user-node-pools
The challenge here is about the daemonset, since it runs on all the nodes, and we need to support various kind of clusters, small/large, different VM SKUs on different agent pools. For example, the smallest VM SKU we support is F2 which has 4 GB, and ~2GB allocatable memory. We usually set the memory request to the minimum footprint of the pod, and set the memory limit to a large number from the “large cluster” scaling test. If we increase the request to limit, ds pods might not fit on the F2 VM.
The omsagent memory usage depends on how many log is written by all the pods running on the node, mostly contributed by customer’s application pods.
Adding Policy team to comment on there pods.
The problem here is: There is no memory overcommit. You can overcommit CPUs but not memory (if you don't have swap which is strongly not recommended by Kubernetes).
So if the application Pods are using 99% of memory and eg. omsagent or even kubeproxy (which is not reserved at all) uses 2% more memory this will trigger the OOM killer on the host and kills a container or application more or less randomly.
IMHO the options are:
1) make it configurable (eg. implement an operator) how much the customer wants to reserve/limit (or maybe the customer doesn't care) 2) set request = limit 3) or let the customer deal with the fact that this can trigger an OOMkiller on the host and randomly 4) the customer has to deploy a pause pod which reserves eg. 500Mi to deal with the fact that there are apps which are not reserving their memory in a proper way
I'm not that happy about the statement with system and user node pools. The argument (from the microsoft conultant) for AKS was that the customer doesn't have to care for the master and now it feels like there are "masters" (sort of) again. This would mean that system nodes cannot be used for anything else (because they are "unstable") and would reduce the total node count by one pool (only having 9 pools left for applications). With proper memory reservation there is no real reason not to use these nodes :)
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
@juan-lee @robbiezhang
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
prometheus-operator always alerts on those Pods:
52.5% throttling of CPU in namespace kube-system for container openvpn-client in pod aks-link-7676547cc6-5lfgd.
This should be fixed or somehow we should be able to configure it.
Issue needing attention of @Azure/aks-leads
@Kaarthis FYI.
Applications might use more memory in the startup time, but reduce later after GC. If we set the memory request/limit low, it maybe OOMKilled (as highlighted), and cause the pod fail to start. if AKS set the request/limit high for start up, it may cause memory utilization low in the runtime and waste of the VM compute power. Therefore, AKS strategy is to set the request to close to the regular usage, limit to the spike usage.
The memory consumption pattern for different applications are different. It will be different for different clusters as well. AKS today provides a high limit based on some estimation analysis. Having said that there is definitely an opportunity in providing a configurable memory limits for customers - This is something we shall consider in our Roadmap this year, though there is no immediate solution available.
In Kubernetes you can use nearly 100% of the Node memory by reserving it for Pods. So applications can reserve their memory to the limit what they are using and so improve the performance.
Unlike CPU there is no memory overcommitment so if Pods are using 99% of VM's memory (because it was reserved that way) the AKS maintained Pods can run into the overcommitment (use more than reserved) and trigger an Kernel OOM killer (instead of the cgroup memory killer) which kills innocent Pods/Containers.
To solve it: Don't overcommit memory because there is no way to overcommit it (because there is no swap).
Workaround: The customer has to reserve eg. 5% to 10% of memory on every system node because AKS is not caring about memory reservations. We're not talking about some megabytes, we're talking about more then 1 GiB of memory.
Either solve your memory reservation problem or make it configurable for the customer. But simply overcommit a resource which is not overcommit able is not really a professional way to run a managed Kubernetes cluster.
I was wondering if this might result into issues after node upgrades via Azure portal when memory usage might spike up above 100%. Especially when automatic nodescaling is disabled.
I've seen one node where for example where a few pods were scheduled based on their request, but their limits are causing huge overcommitment.
It is likely a choice whether or not you want to set the memory request a bit closer to the limit in order to prevent too much overcommitment and run reliably during huge memory workload. Ofcourse the tradeoff is that you will need more/better spec nodes. It depends wether you are using the node for development and testing or production. With production you might want a tighter request and limit spec.
What I also encounter is one node using memory above 100%, stated by kubectl top nodes. This might be due to nodes set to a fixed number where it is not able to autoscale. However if you add one node more, it does not evict the pod and schedule on the new node.
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
bump
example gatekeeper-system
(runs only on system nodes):
audit (1 Pod): 256 Mi reseverd, 3 Gi limit
controller (2 Pods): 256 Mi reserved, 2 Gi limit
so gatekeeper can use up 6.25 Gi of memory which is never reserved.
example kube-system/omsagent-*
(runs on all nodes):
1 Pod per node
container omsagent
: 325 Mi reserved, 750 Mi limit
container omsagent-prometheus
: 225 Mi reserved, 1 Gi limit
commited memory: 550 Mi overcommitted memory: 1.2 Gi (worst case) average memory usage in our clusters: ~800 Mi
I'm running into this too. I'm using Standard_B2s with 4 GiB of memory. These are the kube-system pod requests/limits:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1158m (60%) 10550m (555%)
memory 1982Mi (91%) 11724Mi (543%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
In reality they use only about 40%, but simply adding a new pod with any sort of request causes it to overcommit and fail with OOMKilled status. Running multiple instances of the same pod without resource requests/limits works fine.
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
:eyes:
I've tried patching the daemonsets / deployments with lower values, but they reset back to the original ones after re-deployment. This is greatly reducing on how much customers can use the (paid) resources...
@aldycool Lower values might not help as you will run into overcommitment of memory. Memory CANNOT be overcommitted, but CPU can (containers might be slow or starve regarding to processing power).
If you don't reserve the memory you're using and hit the 100% limit the Kernel will kill randomly your containers with a SIGKILL. This issue is about the fact that AKS services doesn't reserve the memory they are using and so you have to be careful with your memory usage.
Kubernetes can and will use all memory you're reserving, if you overcommit memory you're risking Kernel OOM killer.
(You could enable swap but Kubernetes is not recommending to do this as you will not get consistent performance anymore.
@mblaschke Yes, actually my point is I was surprised that we as customers cannot change the preset values of the Azure specific resources manually, either memory or cpu. Because in my case, the offending part for me is the CPU limits reserved for the AKS resources. By default, Azure AKS deploys these containers with huge CPU limits (instance count is assuming using 3 nodes in a node pool):
If I only have nodes with spec just only 4 CPU per each node, that means the left over for CPU limits for each node is only 1500m (4000m is consumed by cloud-node-manager + azure-ip-masq-agent). I realized this because my Prometheus is throwing alerts of CPU overcommit:
Labels
alertname = KubeCPUOvercommit
prometheus = monitoring/kube-prometheus-stack-prometheus
severity = warning
Annotations
description = Cluster has overcommitted CPU resource requests for Pods by 3.1319999999999997 CPU shares and cannot tolerate node failure.
runbook_url = https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit
summary = Cluster has overcommitted CPU resource requests.
Is there already some idea how to deal with this? In my opinion the ranges are making no sense at all. We want to follow some best practices (e.g. 125% CPU overcommitted) but with these ranges, we already have near 40% committed on a 8vCPU machine just because of these Pods.
Hi @romlod, others
This feedback is valid. However, I'd like to understand a bit further about the overcommitment.
We strongly recommend customers use a dedicated system nodepool for their cluster to be able to host these system pods. This is to ensure the system stability and reserve (I recognize the overcommitment) resources to be able to reliably handle the peaks in load. Why are there peaks in consumption? Partly, we don't own the code for many of these resources and spikes in usage isn't something we can always tune. And they happen.
If you run your user pods in the same nodepool, you risk saturating resources on the system and causing system instability. E.g. one runaway pod could result in downtime for all of the cluster if you evict or fail the CoreDNS pods.
Given this, we preferred reliability over utilization and did overcommit resources. In @romlod's case, you could run a system nodepool with 4CPU machines and improve utilization while you run your applications on a user pool. Does this help with your concern?
I still recognize that you should be able to customize some of these settings in case you are running a very busy cluster and the limits set are not accommodative. But I don't know if I fully understand why overcommitment is an issue. E.g. if we were to set requests=limits, then you couldn't run a cost-optimized small cluster and would at least need a 4 vCPU x 3 nodepool just for running system pods.
Hello. We use a dedicated system node pool. However, even if running a node in a user node pool, each of these nodes minimally contain pods like omsagent, csi-*, kube-proxy, cloud-node-manager and azure-ip (maybe even more). If you look at the pod cloud-node-manager, it has a cpu requests of 50m but csn burst to 2000.
By having these kind of numbers, no need to take this into account. So having a system node pool wouldnt solve this because this is by default.
Next to that, if this is by default, then I really think that the documentation should be changed so that everyone can do 1) a proper sizing and 2) know what they are dealing with when choosing AKS
@seguler AKS Pods are running on overcommit even on non System Nodes. As a possible fix AKS users have to run a pause container which just reserve the overcommit memory and only run AKS pods on system nodes and nothing else. This leads to the question why system nodes are not tainted for other pods if they should not been used for other workloads.
But also on System nodes the AKS Pods are overcommitting memory and this would need swap to be enabled as a possible fix (and a very very bad solution). If AKS Pods are using too much memory it could cause a Kernel OOM Killer so this could even mean we're losing logs or causing service outages.
see https://github.com/Azure/AKS/issues/2125#issuecomment-1176242678
another example:
calico-nodes
, calico-typha
and tigera-operator
are coming totally WITHOUT resource reservation and so consuming non reserved memory on ALL nodepools. Calico could consume 100% of node memory and so lead to a total service outage.
I'm wondering why nobody from AKS team is actually responding to the actual issue but only referencing "use system nodes" 🤔
-tl/dr- AKS product group is not following best practices for Kubernetes regarding resource reservations and puts customer workload at risk (Kernel OOM Killer). https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits?hl=en
I know that the system node pools recommendation is already there in the docs: https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli#system-and-user-node-pools, but I think it needs more explanation, why those restrictions exists in the first place. Of course it already mention information like the minimum CPU, Memory, etc. But, I think for more advanced customers, more detailed and technical reasons should also be mentioned there, especially the current behavior of these AKS pods, so customers will know in advance of what are they getting into.
the AKS opensource mesh addon (osm) doesn't reserve any resources for the envoy sidecars, too. That means the osm addons is not really production ready as it could trigger kernel OOM killers.
the AKS opensource mesh addon (osm) doesn't reserve any resources for the envoy sidecars, too. That means the osm addons is not really production ready as it could trigger kernel OOM killers.
The OSM addon actually allows users to change the sidecar resource requests/limits via a CRD called Config v1alpha2 API Reference. We do this because the resource utilization of the sidecar varies wildly depending on the cluster.
@keithmattix if I deploy OSM via helm-chart i would expect that there are no sane defaults and i have to care for the service. My deployment, my responsibility and my "production readiness" problem.
But what is the scope of a "managed service"? As a managed service user I would expect sane defaults and documentation how to get the managed service into a production ready state and how to tweak the defaults. But if the managed service is not providing production readiness by default what's then the benefit for using a managed service? So what does "GA/production ready" mean for all the AKS services?
That also applies for example to the calico addon.
I think I see your point; we could provide an arbitrary request as an upper bound to assist in scheduling. I am pretty hesitant to include a limit by default because of the potential for disruption (since Envoy does run as a sidecar, when it gets OOMKilled, the application container gets restarted as well).
we don't care so much about cpu overcommit as the cpu can be overcommited but memory cannot. If we run out of memory it will trigger a Kernel OOM killer (!= cgroup/container OOM killer) and will randomly kill workload with a SIGKILL which could also lead to data loss.
Currently all AKS clusters are deployed in a way as it would use swap even if no swap is configured and it's actually not recommended by Kubernetes for several reasons. So memory needs to be reserved or we risk production workload to be killed by Kernel OOM killer with a SIGKILL.
here the current situation with an AKS 1.24 cluster with previews (keda, vpa) also enabled:
namespace | pod | type | memory overcommit/pod |
---|---|---|---|
kube-system | coredns-autoscaler-* | deployment | 944 MiB |
kube-system | azure-policy-webhook-* | deployment | 150 MiB |
kube-system | csi-azuredisk-node-* | daemonset | 340 MiB |
kube-system | csi-azurefile-node-* | daemonset | 540 MiB |
kube-system | cloud-node-manager-* | daemonset | 462 MiB |
kube-system | coredns-* | deployment | 430 MiB |
kube-system | keda-operator-* | deployment | 900 MiB |
kube-system | keda-operator-metrics-apiserver-* | deployment | 900 MiB |
kube-system | konnectivity-agent-* | deployment | 108 MiB |
kube-system | metrics-server-* | deployment | 270 MiB |
kube-system | ama-logs-* | daemonset | 1224 MiB |
kube-system | ama-logs-rs-* | deployment | 774 MiB |
kube-system | osm-controller-* | deployment | 832 MiB |
kube-system | vpa-admission-controller-* | deployment | 300 MiB |
kube-system | vpa-recommender-* | deployment | 500 MiB |
kube-system | vpa-updater-* | deployment | 500 MiB |
gatekeeper-system | gatekeeper-audit-* | deployment | 2816 MiB |
gatekeeper-system | gatekeeper-controller-* | deployment | 1792 MiB |
so our testing cluster (with AKS 1.24) was ramped up with 5 nodes and AKS is overcommitting memory by 27626 MiB.
Every node is overcommited with ~2500 MiB memory via Daemonsets (independent from pool type).
Using system nodes without any additional payload are not actually solving it as this puts even AKS system pods (eg. from deployments) on risk to get killed by Kernel OOM killer.
eg: coredns-autoscaler runs with request of 10MiB but is allowed to use up to 1000MiB before getting killed.
For better understanding: If your harddrive is full, you cannot sqeeze more bytes onto it and your writes will fail. The same applies to memory but here the Kernel is just sending you a not so friendly OOM Killer to solve the situation.
As a AKS customer I would like to have Microsoft to take care for my production workload that it's not getting a victim of a SIGKILL by the Kernel OOM Killer.
We appear to have an issue where we're using defined requests=limits some way below the allocation threshold - 112G limit on 115G allocatable with worker nodes sized at 125G, causing significant service outages when the OOM killer kills our application after some AKS internal process starts operating at load.
If we're using guaranteed pods for the application, it should be possible to place the same limits on AKS pods so we can effectively tune overall memory for the required workload, or at the very least have a clearly defined terms of reference for the AKS requirements as Markus has detailed above
My kube system takes too much resources on nodes, this leads to very less res available for pods deploy, which leads to multiple nodes in nodepool just for few pods
@olsenme , @robbiezhang can one of you respond ?
Hello to everyone in this thread👋
This issue is recognized as a legitimate problem, and we are actively working on a design to enable more intelligent scaling for AKS maintained pods no matter the cluster size and composition. There is a fair amount of complexity inherent to a general solution as there is no one size fits all scaling strategy for each AKS managed addon.
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
:eyes:
Any updates on where you with resolving this problem?
Can we have an update?
We still have the same Issue
aks-default-98530742-vmss000000 * * 1025m (53%) 6950m (365%) 1660Mi (36%) 12380Mi (272%)
aks-default-98530742-vmss000000 kube-system aks-secrets-store-csi-driver-vx4df 70m (3%) 300m (15%) 140Mi (3%) 500Mi (11%)
aks-default-98530742-vmss000000 kube-system aks-secrets-store-provider-azure-zcfbj 50m (2%) 100m (5%) 100Mi (2%) 100Mi (2%)
aks-default-98530742-vmss000000 kube-system ama-logs-ttk26 170m (8%) 1100m (57%) 600Mi (13%) 1874Mi (41%)
aks-default-98530742-vmss000000 kube-system ama-metrics-ksm-84dbc9cbc8-4d9z6 5m (0%) 1000m (52%) 50Mi (1%) 5120Mi (112%)
aks-default-98530742-vmss000000 kube-system ama-metrics-node-n7fjz 70m (3%) 700m (36%) 180Mi (3%) 1524Mi (33%)
aks-default-98530742-vmss000000 kube-system azure-ip-masq-agent-zpggf 100m (5%) 500m (26%) 50Mi (1%) 250Mi (5%)
aks-default-98530742-vmss000000 kube-system azure-npm-6m56g 250m (13%) 250m (13%) 300Mi (6%) 1000Mi (22%)
aks-default-98530742-vmss000000 kube-system cloud-node-manager-s4tzx 50m (2%) 0m (0%) 50Mi (1%) 512Mi (11%)
aks-default-98530742-vmss000000 kube-system coredns-fb6b9d95f-6kqwt 100m (5%) 3000m (157%) 70Mi (1%) 500Mi (11%)
aks-default-98530742-vmss000000 kube-system csi-azuredisk-node-4dmbq 30m (1%) 0m (0%) 60Mi (1%) 400Mi (8%)
aks-default-98530742-vmss000000 kube-system csi-azurefile-node-jb9kg 30m (1%) 0m (0%) 60Mi (1%) 600Mi (13%)
aks-default-98530742-vmss000000 kube-system kube-proxy-b5mdm 100m (5%) 0m (0%) 0Mi (0%) 0Mi (0%)
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
What happened: Many AKS maintained Pods are running with memory overcommitment, eg:
for example OMS agent pods (from daemonset
omsagent
) are running with:on high load and if node memory is used eg. 99% (which is possible with Kubernetes) this might trigger an OOM Killer on the host (not on the pod!) and so might affect other Pods!
What you expected to happen:
be fair and set limit and request for memory to the same values. don't overcommit memory.
How to reproduce it (as minimally and precisely as possible):
spin up AKS in 1.18 or 1.19 w/ and w/o services (eg. omsagent, policy agent,...)
Anything else we need to know?:
Environment:
kubectl version
): 1.18.14 (also with 1.19.7)