Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 304 forks source link

[BUG] kube-system pods reserve 35 % of allocatable memory on a 4 GB node #3525

Closed nemobis closed 3 months ago

nemobis commented 1 year ago

Describe the bug On AKS with kubernetes 1.24, a node with 4 GB RAM capacity only has 2157 MiB allocatable; yet kube-system alone reserves some 750 MB (of which 550 MB for azure-cns and azure-npm), leaving less than 1400 MiB available for requests by others.

To Reproduce Steps to reproduce the behavior:

  1. Create a node pool with nodes having 4 GB memory
  2. Check kube-capacity or kubectl describe node on a recently created node
  3. Optionally inspect actual resource usage over time with the node-exporter metrics on Prometheus and something like the Kubernetes Monitor Grafana dashboard

Example node:

Addresses:
  InternalIP:  10.<redacted>
  Hostname:    aks-userpool2-11<redacted>
Capacity:
  cpu:                2
  ephemeral-storage:  259966896Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             4025836Ki
  pods:               100
Allocatable:
  cpu:                1900m
  ephemeral-storage:  239585490957
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             2209260Ki
  pods:               100
System Info:
  Machine ID:                 e619<redacted>
  System UUID:                95e<redacted>
  Boot ID:                    5627<redacted>
  Kernel Version:             5.4.0-1098-azure
  OS Image:                   Ubuntu 18.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.4+azure-4
  Kubelet Version:            v1.24.6
  Kube-Proxy Version:         v1.24.6
ProviderID:                   azure:///<redacted>
Non-terminated Pods:          (11 in total)
  Namespace                   Name                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                  ------------  ----------  ---------------  -------------  ---
  datadog-agent               datadog-agent-<redacted>                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         37m
  kube-system                 azure-cns-<redacted>                                   40m (2%)      40m (2%)    250Mi (11%)      250Mi (11%)    38m
  kube-system                 azure-npm-<redacted>                                   250m (13%)    251m (13%)  300Mi (13%)      400Mi (18%)    38m
  kube-system                 cloud-node-manager-<redacted>                          50m (2%)      0 (0%)      50Mi (2%)        512Mi (23%)    38m
  kube-system                 csi-azuredisk-node-<redacted>                          30m (1%)      0 (0%)      60Mi (2%)        400Mi (18%)    38m
  kube-system                 csi-azurefile-node-<redacted>                         30m (1%)      0 (0%)      60Mi (2%)        600Mi (27%)    38m
  kube-system                 kube-proxy-<redacted>                                 100m (5%)     0 (0%)      0 (0%)           0 (0%)         38m
  kube-system                 node-local-dns-<redacted>                              25m (1%)      0 (0%)      5Mi (0%)         0 (0%)         38m
...

Expected behavior A node with 4 GB of RAM should be able to be assigned a pod which requests 1600 MB of RAM (e.g. for Prometheus). (I'm not talking of limits.)

Screenshots Screenshot_20230309_120410

Environment (please complete the following information):

Additional context

There's been a lot of discussion about what the requests and limits should be for various components, but in this case the issue is only with the value of the allocatable memory, so I believe it's orthogonal. If everything in kube-system is requesting way more memory than it needs most of the time, there's no need for such a huge buffer. At very least it should be configurable, or the really available memory should be made clearer so that people can configure their loads and nodepools accordingly, without tinkering with eviction thresholds.

https://github.com/Azure/AKS/issues/1339 https://github.com/Azure/AKS/issues/2125 https://github.com/Azure/AKS/issues/3348 https://github.com/Azure/AKS/issues/3496

I think it's unrelated from https://github.com/Azure/AKS/issues/3443

nemobis commented 1 year ago

Some of the overall settings here are supposed to be configurable in kubernetes, see e.g. https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#enforcing-node-allocatable , but they don't seem to be on AKS according to what I heard so far from Azure Support.

FlorentATo commented 1 year ago

@nemobis you can check the configuration of kubelet by yourself by running a debug pod on the node and look at the process snapshot:

➜  ~ kubectl debug node/aks-systempool-21850828-vmss000000 -it --image=mcr.microsoft.com/dotnet/runtime-deps:6.0
Creating debugging pod node-debugger-aks-systempool-21850828-vmss000000-chs4s with container debugger on node aks-systempool-21850828-vmss000000.
If you don't see a command prompt, try pressing enter.
root@aks-systempool-21850828-vmss000000:/# chroot /host
# bash
root@aks-systempool-21850828-vmss000000:/# ps fauxww | grep '/usr/local/bin/kubelet'

I ran into the same "issue"; using a VM with only 4GiB of memory (Standard F2S v2) returns the following:

➜  ~ k describe node aks-systempool-21850828-vmss000000
(...)
Capacity:
  cpu:                2
  ephemeral-storage:  129886128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             4025836Ki
  pods:               110
Allocatable:
  cpu:                1900m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             2209260Ki
  pods:               110

According to the documentation, kubelet will reserve 25% of memory (i.e. 1GiB).

Indeed, using the method described above, you can see kubelet runs with the following flags:

So in total 1816576kiB of memory is reserved; and thus: 4025836-1816576=2209260KiB i.e the amount reported by AKS.

ghost commented 1 year ago

Action required from @Azure/aks-pm

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 7 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 6 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 6 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 5 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 5 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

stl327 commented 3 months ago

Hello, beginning with AKS 1.29 preview and beyond, we shipped changes to the eviction threshold and memory reservation for kube-reserved. The new rate of memory reservations is set according to the lesser value of: 20MB * Max Pods supported on the Node + 50MB or 25% of the total system memory resources. The new eviction threshold is 100Mi. See more information here. These changes will help reduce the resource consumption by AKS and can deliver up to 20% more allocatable space depending on your pod configuration. Thanks!