Open AdelRefaat opened 7 months ago
Thanks, @AdelRefaat for the feedback. We will triage and address this.
Just to put this in perspective when this issue is resolved, this will imply huge cost & energy savings for everyone using Azure Monitor on AKS with Windows Node Pools.
If we assume adjusting the CPU requests to be ≈ 150m The potential savings is about 900m - 150m = 0.75 CPU per VM
Example: if we had a node pool with 20 VM of DS2_V2 (2 CPU each) Potential Savings = 20 VM * 0.75 CPU/VM = 15 CPU ≈ 7 virtual machines ( i.e. 35% saving) New node pool size ≈ 13 VM instead of 20 VM
Hope this can be addressed soon.
Is there any ETA for this fix ?
In the absence of an update on this issue. I will share what I know.. I already opened a ticket with Microsoft Support regarding this issue, and after lots of communication they finally told me a fix is supposed to be rolled out by end of May... Of course I am not sure about this, but this is all what I know.
Again, I would say this is a very resources & money wasteful issue, that I hope should be addressed with proper priority.
All AKS users with Windows nodes and Azure Monitor on are paying for unused resources! see my earlier comment https://github.com/Azure/AKS/issues/4187#issuecomment-2036503312
@AdelRefaat Have you reached the billing team for this issue? I think if this is verified bug, you can ask for some return on this?
@AdelRefaat Have you reached the billing team for this issue? I think if this is verified bug, you can ask for some return on this?
@JoeyC-Dev Thanks, but how many others are still paying now for this? the company I work for has about 4 clusters I discovered this while reviewing them 😟
@AdelRefaat Have you reached the billing team for this issue? I think if this is verified bug, you can ask for some return on this?
@JoeyC-Dev Thanks, but how many others are still paying now for this? the company I work for has about 4 clusters I discovered this while reviewing them 😟
Guess we never know. Because situation like this is quite on basis of different scenario. For company never use auto scale and then this won't be extra cost anyway.
@AdelRefaat thank you for the feedback! We are looking to have this addressed in June and we will update this thread if anything changes
@AdelRefaat thank you for the feedback! We are looking to have this addressed in June and we will update this thread if anything changes
Thanks for the update @vdiec
I don't know if any one else has noticed this restarts in ama-logs-windows or not. but we have seen multiple restarts. where addon-token-adapter-win container crashes with following information in the logs:
2024/05/13 10:21:53 helpers.go:88: received event type ADDED 2024/05/13 10:21:54 cmd.go:132: error setting up port proxy rule: failed to assign IP to veth interface when executing command
Do we have any ETA for the fix ?
@tgoutham20 this will be included in our next release and ETA is end of June. I will update this thread once the fix is released
Thank you for the update @vdiec
Also noticed this huge resource consumption for Ama Metrics pods.
@david-garcia-garcia can you please create a separate issue for ama metrics pods?
The ama-logs-windows change is rolling out this month.
I"m not running windows nodes, but I am running perfect scale and my cluster CPU request is 4x more than it should be because of these daemonsets:
is there any way to adjust the cpu and memory requests and limits of those daemonsets....or the replica count (which seems to be set to 8?
@jason-berk-k1x For ama-logs, there is only 1 replica count. The cpu and memory requests and limits are not adjustable, but we are working on integrating VPA to address this.
@aritraghosh can you address the other daemonsets?
This has been rolled fully.
Describe the bug Enabling Azure Monitor on AKS with windows node pools creates AMA containers with high CPU requests (900m)
To Reproduce
ama-logs-windows
DaemonSet created containersama-logs-windows
andaddon-token-adapter-win
ama-logs-windows
requests => 500maddon-token-adapter-win
requests => 400mama-logs-windows
is about 40m (far from 500m request)addon-token-adapter-win
is about 106m (far from 400m request)Expected behavior
ama-logs-windows
andaddon-token-adapter-win
addon-token-adapter-win
as currently it does not define arequest
but onlylimit
which will be used as therequest
by k8s by design.Screenshots This is from the default
ama-logs-windows
DaemonSet yamlThis is from Insights looking on max CPU usages/ utilization
Environment (please complete the following information):
Additional context This is happening only on Windows Node Pools .. as Linux AMA containers are around 170m cpu requests which is reasonable
Side Note May be Not Related The Cluster AutoScaler was not able to scale down empty nodes with just kube-system nodes when Azure Monitor was enabled, and
ama-logs-windows
containers were there .. but once Azure Monitor has been disabled andama-logs-windows
has been removed AutoScaler scaled down as expected.. I don't know whether this is related or not ..