Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 304 forks source link

AKS maintained Pods should not overcommit memory #2125

Open mblaschke opened 3 years ago

mblaschke commented 3 years ago

What happened: Many AKS maintained Pods are running with memory overcommitment, eg:

for example OMS agent pods (from daemonset omsagent) are running with:

       resources:
          limits:
            cpu: 500m
            memory: 600Mi
          requests:
            cpu: 75m
            memory: 225Mi

on high load and if node memory is used eg. 99% (which is possible with Kubernetes) this might trigger an OOM Killer on the host (not on the pod!) and so might affect other Pods!

What you expected to happen:

be fair and set limit and request for memory to the same values. don't overcommit memory.

       resources:
          limits:
            cpu: 500m
            memory: 600Mi
          requests:
            cpu: 75m
            memory: 600Mi

How to reproduce it (as minimally and precisely as possible):

spin up AKS in 1.18 or 1.19 w/ and w/o services (eg. omsagent, policy agent,...)

Anything else we need to know?:

Environment:

timpeeters commented 5 months ago

Issue is still present. Please don't close this.

microsoft-github-policy-service[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

zeno420 commented 3 months ago

Issue is still present. Please don't close this.

leandro-scardua commented 3 months ago

Issue is still present. Please don't close this.

kevinkrp93 commented 3 months ago

We are working on a feature to help with this. Will preview it sometime later in the year.

marcindulak commented 3 months ago

I hope one of the maintainers of this repo can add this feature to the roadmap, and also do something so it's not auto-closed as happened to other roadmap features https://github.com/Azure/AKS/issues/3708.

SebSa commented 2 months ago

This issue is raising questions about the value and hidden costs of microsoft's AKS integrated services.

There are msft managed cpu requests defined in every aks service pod manifest, which can quickly add up to 30% of your node cpu, forcing the scheduler to preserve msft's pods as oppose to the customer's.

Every scale up a customer must do to accomodate this bloat conveniently puts more money in the pocket of the company responsible for the bloat so how much incentive really is there to fix this?

microsoft-github-policy-service[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

whiskeysierra commented 1 month ago

What's the point of these bots...?

On Wed, Jul 10, 2024, 22:06 microsoft-github-policy-service[bot] < @.***> wrote:

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

— Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/2125#issuecomment-2221337137, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADI7HPBD6JBP2M4XR6V4CDZLWH5JAVCNFSM4XYYXQWKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMRSGEZTGNZRGM3Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

kevinkrp93 commented 1 month ago

Will be adding this to the Roadmap for tracking soon.

microsoft-github-policy-service[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

ngbrown commented 1 month ago

Is this issue on the roadmap yet?

nickbrennan1 commented 2 weeks ago

I'd say not Nathan. It's funny, I recently launched a D2 nodepool for some KeyVault testing for a customer, and had to launch additional worker nodes because the default AKS bloatware was consuming <= 50% of resource on each worker node....