Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 306 forks source link

[BUG] ama-logs-windows DaemonSet may be over committing CPU #4187

Open AdelRefaat opened 7 months ago

AdelRefaat commented 7 months ago

Describe the bug Enabling Azure Monitor on AKS with windows node pools creates AMA containers with high CPU requests (900m)

To Reproduce

  1. Enable Azure Monitor on AKS with Windows Node pool
  2. Check CPU resource requests for ama-logs-windows DaemonSet created containers ama-logs-windows and addon-token-adapter-win
  3. ama-logs-windows requests => 500m
  4. addon-token-adapter-win requests => 400m
  5. Actuals utilization of ama-logs-windows is about 40m (far from 500m request)
  6. Actual utilization of addon-token-adapter-win is about 106m (far from 400m request)
  7. This overcommitting resulted that a node pool of Ds2_v2 with (2 CPUs) that originally had 8-9 nodes before enabling Azure Monitor will easily become 18 nodes. almost double.

Expected behavior

Screenshots This is from the default ama-logs-windows DaemonSet yaml

image

image

This is from Insights looking on max CPU usages/ utilization image

image

Environment (please complete the following information):

Additional context This is happening only on Windows Node Pools .. as Linux AMA containers are around 170m cpu requests which is reasonable

Side Note May be Not Related The Cluster AutoScaler was not able to scale down empty nodes with just kube-system nodes when Azure Monitor was enabled, and ama-logs-windows containers were there .. but once Azure Monitor has been disabled and ama-logs-windows has been removed AutoScaler scaled down as expected.. I don't know whether this is related or not ..

ganga1980 commented 7 months ago

Thanks, @AdelRefaat for the feedback. We will triage and address this.

AdelRefaat commented 7 months ago

Just to put this in perspective when this issue is resolved, this will imply huge cost & energy savings for everyone using Azure Monitor on AKS with Windows Node Pools.

If we assume adjusting the CPU requests to be ≈ 150m The potential savings is about 900m - 150m = 0.75 CPU per VM

Example: if we had a node pool with 20 VM of DS2_V2 (2 CPU each) Potential Savings = 20 VM * 0.75 CPU/VM = 15 CPU ≈ 7 virtual machines ( i.e. 35% saving) New node pool size ≈ 13 VM instead of 20 VM

Hope this can be addressed soon.

sbhattach commented 6 months ago

Is there any ETA for this fix ?

AdelRefaat commented 6 months ago

In the absence of an update on this issue. I will share what I know.. I already opened a ticket with Microsoft Support regarding this issue, and after lots of communication they finally told me a fix is supposed to be rolled out by end of May... Of course I am not sure about this, but this is all what I know.

Again, I would say this is a very resources & money wasteful issue, that I hope should be addressed with proper priority.

All AKS users with Windows nodes and Azure Monitor on are paying for unused resources! see my earlier comment https://github.com/Azure/AKS/issues/4187#issuecomment-2036503312

JoeyC-Dev commented 6 months ago

@AdelRefaat Have you reached the billing team for this issue? I think if this is verified bug, you can ask for some return on this?

AdelRefaat commented 6 months ago

@AdelRefaat Have you reached the billing team for this issue? I think if this is verified bug, you can ask for some return on this?

@JoeyC-Dev Thanks, but how many others are still paying now for this? the company I work for has about 4 clusters I discovered this while reviewing them 😟

JoeyC-Dev commented 6 months ago

@AdelRefaat Have you reached the billing team for this issue? I think if this is verified bug, you can ask for some return on this?

@JoeyC-Dev Thanks, but how many others are still paying now for this? the company I work for has about 4 clusters I discovered this while reviewing them 😟

Guess we never know. Because situation like this is quite on basis of different scenario. For company never use auto scale and then this won't be extra cost anyway.

vdiec commented 6 months ago

@AdelRefaat thank you for the feedback! We are looking to have this addressed in June and we will update this thread if anything changes

AdelRefaat commented 6 months ago

@AdelRefaat thank you for the feedback! We are looking to have this addressed in June and we will update this thread if anything changes

Thanks for the update @vdiec

sbhattach commented 5 months ago

I don't know if any one else has noticed this restarts in ama-logs-windows or not. but we have seen multiple restarts. where addon-token-adapter-win container crashes with following information in the logs:

2024/05/13 10:21:53 helpers.go:88: received event type ADDED 2024/05/13 10:21:54 cmd.go:132: error setting up port proxy rule: failed to assign IP to veth interface when executing command

tgoutham20 commented 5 months ago

Do we have any ETA for the fix ?

vdiec commented 5 months ago

@tgoutham20 this will be included in our next release and ETA is end of June. I will update this thread once the fix is released

SebSa commented 4 months ago

Thank you for the update @vdiec

david-garcia-garcia commented 4 months ago

Also noticed this huge resource consumption for Ama Metrics pods.

vdiec commented 4 months ago

@david-garcia-garcia can you please create a separate issue for ama metrics pods?

The ama-logs-windows change is rolling out this month.

jason-berk-k1x commented 3 months ago

I"m not running windows nodes, but I am running perfect scale and my cluster CPU request is 4x more than it should be because of these daemonsets:

is there any way to adjust the cpu and memory requests and limits of those daemonsets....or the replica count (which seems to be set to 8?

vdiec commented 3 months ago

@jason-berk-k1x For ama-logs, there is only 1 replica count. The cpu and memory requests and limits are not adjustable, but we are working on integrating VPA to address this.

@aritraghosh can you address the other daemonsets?

ganga1980 commented 1 month ago

This has been rolled fully.