Open smartaquarius10 opened 1 year ago
Hello, We have the same problem with 1.25.4 version in our Company AKS.
We are trying to upgrade an app to openjdk17 to check if this new LTS Java version mitigates the problem.
Edit: In our case, .Net apps needed to change the nugget package for Application Insights.
Greets,
@xuanra , My major pain point is these 2 pods out of 9 of them
My other pain point is these 16 pods(8 each)
They take 910 Mi of memory. I even raised the support ticket but customer support was unable to figure out whether we are using them or not. In addition, unable to suggest that when we should keep or why we should keep.
Still looking for the better solution to handle the non-prod environment...
Hello, we are facing the same problem of memory spikes moving from v1.23.5 to v1.25.4. We had to increase the memory limit of most of the containers
@miwithro @ritazh @Karishma-Tiwari-MSFT @CocoWang-wql @jackfrancis @mainred
Hello,
Extremely sorry for tagging you. But our whole non prod environment is not working. We haven't upgraded our prod environment yet. However, engineers are unable to work on their applications.
Few days back, we have approached customer support for node performance issues but did not get any good response.
Would be really grateful for help and support on this as it seems to be a global problem.
I need to share one finding. I have just created 2 different AKS clusters with v1.24.9 and v1.25.4 with 1 nodes of Standard B2s
These are the metrics. In case of v 1.25.4 there is a huge spike after enabling monitoring.
We've got the same problem with memory after upgrading AKS from version 1.24.6 to 1.25.4:
In the monitoring of memory for the last month of one of our deployment, we can clearly see the memory usage increase after the update (01/23):
Hello, Our cluster has D4s_v3 machines. We still haven't found any patron in the apps that raised the memory demanded and the apps they don't between all our Java and .Net pods. One alternative to upload Java from 8 to 17 that one of our providers told us is to upload the version of our VM from D4s_v3 to D4s_v5 and we are studing the impact of this change.
Greets,
@xuanra , I think in that case B2s are totally out of picture for this upgrade.. The max they are capable of supporting is till 1.24.x version of AKS
@xuanra , My major pain point is these 2 pods out of 9 of them
- ama-logs
- ama-logs-rs They always takes more that 400 Mi of memory.. Its very difficult to accommodate them in B2S nodes.
My other pain point is these 16 pods(8 each)
- csi-azuredisk-node
- csi-azurefile-node
They take 910 Mi of memory. I even raised the support ticket but customer support was unable to figure out whether we are using them or not. In addition, unable to suggest that when we should keep or why we should keep.
Still looking for the better solution to handle the non-prod environment...
Hi, @smartaquarius10 , Thanks for the feedback. We have work planned to reduce the ama-logs agent memory foot print and we will update the exact timelines and additional details of the improvements in early March. cc: @pfrcks
@ganga1980 @pfrcks
Thank you so much Ganga.. We are heavily impacted because of this. Till 1.24.x version of AKS we were running 3 environments within our AKS. But, after upgrading to 1.25.x version we are unable to manage even 1 environment.
Each environment has 11 pods.
Would be grateful for your support on this. I have already disabled the csi pods as we are not using any storage. For now, should we disable these ama monitoring pods as well..
If yes, then once your team resolve these issues should we upgrade our AKS again to some specific version or microsoft will resolve from backend in every version of AKS infra.
Thank you
Kind Regards, Tanul
Hello @ganga1980 @pfrcks ,
Hope you are doing well. By any chance, is it possible to speed up the process a little.. Actually our 2 environments (which is 22 micro services) are down because of this.
Appreciate your help and support in this matter. Thank you. Have a great day.
Hello @xuanra @cedricfortin @lsavini-orienteed, Did you find any workaround for this. Thanks :)
Kind Regards, Tanul
Hi @smartaquarius10, we updated the k8s version of AKS to 1.25.5 this week and start suffering from the same issue.
In our case, we identified a problem with the JRE version when dealing with cgroups v2. Here I share my findings:
Kubernetes cgroups v2 reached GA on the version 1.25.x and with this change AKS changed the OS of the nodes from Ubuntu18.04 to Ubuntu22.04 that already uses cgroups v2 by default.
The problem of our containarized apps were related with a bug on JRE 11.0.14. This JRE didn't had support for cgroups v2 container awareness. This means that the container were not able to respect the imposed memory quotas defined on the deployment descriptor.
Oracle and OpenJDK addressed this issue by supporting it natively on JRE 17 and backporting this fix to JRE 15 and JRE 11.0.16++.
I've updated the base image to use a fixed JRE version (11.0.18) and the memory exhaustion was solved.
Regarding AMA pods, I've compared the pods running on k8s 1.25.X with the pods running on 1.24.X and in my opinion seems stable as the memory footprint is literally the same.
Hope this helps!
@gonpinho , Thanks a lot for sharing the details. But the problem is that our containerized apps are not taking extra memory.. They are still occupying the same as they were taking before with 1.24.x..
What I realized is that I have created a fresh cluster 1.24.x and 1.25.x and by default memory occupancy is appox. 30% more in 1.25.x..
My one environment takes only 1 GB of memory consisting of 11 pods.. With AKS 1.24.x I was running 3 environments in total. The moment I shifted to 1.25.x I have to disable 2 environments along with the microsoft CSI addons as well just to accommodate the 11 custom pods because the node memory consumption is already high.
@gonpinho , By any chance if I can downgrade the OS again to ubuntu 18.0.4 then it would be my first preference. I know that upgrade to ubuntu OS is killing the machines. No idea how to handle this.
Hi, we facing with the same problem after upgrading our dev AKS cluster to 1.25.5 from 1.23.12. Our company develops c/c++ and c# services, so we don't suffer from JRE cgroup v2 issues. We see that memory usage is increasing over time, but nothing - just kube-system pods - are running on the cluster. The sympthoms is that kubectl top no shows much more memory consumption than free on the host OS (ubuntu 22.04). If we force host OS to drop cached memory with the command sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches' the used memory isn't changing but some of the buff/cache memory moves to free, and after it the kubectl top no shows memory usage drop on that node. We came to conclusion, that k8s calculates buff/cache memory into used memory, but it is wrong, because linux OS will use free memory to buffer IO and other things, and it is completely normal operation.
kubectl top no before cache drop:
free before / after cache drop:
kubectl top no after cache drop:
Team, we are seeing the same behaviour after upgrading the cluster from 1.23.12 to 1.25.5. All the microservices running in clusters are .Net3.1. On raising a support request, we got to know that cgroup version has been changes to v2, does anyone have similar scenario. How do we identify cgroup v1 is used in .net 3.1 and can it be the cause for high memory consumption,
Hello @ganga1980, Any update on this please.. Thank you
Hello @ganga1980, Any update on this please.. Thank you @smartaquarius10 , We are working on rolling out our March agent release, which would bring down the usage ama-logs daemonset (linux) by 80 to 100MB. I dont have your cluster name or cluster resource id to investigate and we cant repro the issue you have reported. Please create an support ticket with clusterResourceId details so that we can investigate. The workaround you can try applying the default configmap through kubectl apply -f https://raw.githubusercontent.com/microsoft/Docker-Provider/ci_prod/kubernetes/container-azm-ms-agentconfig.yaml
@ganga1980 , Thank you for the reply. Just a quick question. After raising the support ticket should I send a mail to you on your microsoft id with the details regarding support ticket. Otherwise, it will assign to L1 support which will take a lot of time to get to the resolution.
Or else, if you allow, I can ping you my cluster details on MS teams.
The way you like 😃
Currently, ama pods are taking approx. 326Mi of memory/node
@ganga1980, We already have this config map
@ganga1980 for the csi driver resource usage, if you don't need csi driver, you could disable those drivers, follow by: https://learn.microsoft.com/en-us/azure/aks/csi-storage-drivers#disable-csi-storage-drivers-on-a-new-or-existing-cluster
Hi! It seems we are facing the same issue in 1.25.5. We upgraded a few weeks (24.02) ago and the memory usage (container working set memory) jumped from the moment of the upgrade, according to the metrics tab:
We are using Standard_B2s vms, as this is an internal development cluster - csi drivers are not enabled. Is the issue identified or is there still an investigation on this?
Same issue here after upgrading to 1.25.5. We are using FS2_v2 and we were not able to have the Working Set memory below 100% no matter how many nodes we added to the cluster.
Very dissapointing that all the memory in the Node is used and reserved by Azure Pods.
We had to disable Azure Insights in the cluster.
@vishiy, @saaror would you be able to assist?
Author: | smartaquarius10 |
---|---|
Assignees: | - |
Labels: | `bug`, `azure/oms`, `addon/container-insights` |
Milestone: | - |
@codigoespagueti @Marchelune Yeah, even we are planning to disable azure insights(ama agent pods). However, we are performing few steps for enabling at least one more environment. Not having at least 2 environments was highly jeopardizing the productivity of my team members. For now, at least out of 2 environments are working out of 3 environments.
sync;echo 1 > /proc/sys/vm/drop_caches
Now, we are waiting till the end of march as @ganga1980 team is working on the ama agent pods. If it worked then cool otherwise, we will disable monitoring pods as well.
Kind Regards, Tanul
Same problem here this is a single pod before and after update with the same codebase
This might help some of you, Kubernetes 1.25 included an update to use cgroups v2 api (cgroups is basically how Kubernetes passes settings to the containers).
When this happened on docker-desktop for me, the memory limits on containers simply stopped having any effect - if you asked the container about it's memory it would basically report the amount of system memory on the host.
My solution was to re-enable the deprecated cgroupsv1 api and it all magically worked again ...
So long as you are using a new enough linux kernel I believe cgroupsv2 should work, but it didn't work for me and I'm yet to work out exactly why, but I strongly suggest all these issues are regarding the cgroups change - it DOESN'T only affect java as I think some people seem to believe, it's a linux kernel thing.
Here is a link about the change : https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/
@unluckypixie , Thanks for sharing. How to enable that in AKS. Could you please share the details. Thank you
Hi Team, We are also seeing high memory consumptions after AKS Upgrade!. Do we have any resolutions yet?.
@unluckypixie, Could you please share the process of re-enabling the cgroups v1
Hey @ganga1980, Hope you are doing well.
Did you get any updates on ama pods memory usage issue. Can we expect the resolution of memory footprint by the end of march?
Thank you
Kind Regards, Tanul
Can anyone confirm how much memory ama-logs pods are taking in your AKS nodes. In my case, its 2911 Mi for 8 nodes after excluding the logs of ingress-controller namespace using configmaps of ama-logs pod
kubectl top pods -A|grep ama-|awk '{ print $4 }'|sed s/Mi//g|awk '{ sum+=$1 } END { print sum }'
@unluckypixie If possible can you please describe the process to re-enable tmpfs? (cgroupsv1 api)
@unluckypixie Please, share how you managed to re-enable the cgroups v1.
For those that are here because your Node.js application suffers from not being able to set its heap limit correctly with cgroup v2: you can work around it with --max-old-space-size
.
It will take a while before cgroup v2 support makes it into LTS, because it depends on a libuv
release. For details see https://github.com/nodejs/node/issues/47259.
@smartaquarius10 @PeterThomasAwen @maxkt
I might not have been clear, I've only managed to fix it in docker-desktop so far - if that is what you were asking for you simply edit the settings.json
and change:
"deprecatedCgroupv1": true
We are assessing updating our kubernetes to v1.25.5 and will let you know if we figure out the fix!
@unluckypixie.. Oh ok.. Thank you so much for sharing. Yeah that could be possible in self hosted kubernetes. Not sure though.
If it works then still, I don't think it's possible in AKS whose master node is managed by Microsoft. I'm just guessing.. But, would be awesome if it can be done with AKS..
I'm also highly impacted by this issue and I wish I have seem this issue before.
@ganga1980, Any updates on this. So many people are impacted because of this ubuntu upgrade. Would be grateful if you could expedite the process of this ama-log. At least, it will compensate the things to certain extent. Thank you.
For java I've mitigated this by upgrading images from JRE 11.0.13 to JRE 11.0.18 I'd say it is definitely related to this https://bugs.openjdk.org/browse/JDK-8230305 which was backported to 11.0.16
With JRE 11.0.13 (after 10 minutes running)
and with JRE 11.0.18 (after 10 minutes running)
Only the workload with the arrow was upgraded
@ganga1980 @pfrcks, Any updates on this ama pod memory footprint issue please. Thank you.
Just FYI, as a remedy solution we add extra jvm parameter -Xmx when start application (update Dockerfile and rebuilt image). This solved our issue for the moment.
I'm curious why some users are deploying Java to Kubernetes without setting a heap size.
Anyone mind to share their thinking?
I'm curious why some users are deploying Java to Kubernetes without setting a heap size.
Anyone mind to share their thinking?
We are setting both, Java heap and resource limits in k8s. But, after the upgrade to 1.25, the workloads started to use 2 or 3 times more memory. In our case, upgrading the JRE to version 11.0.18 (probably 11.0.16 would work as well) made our workloads use the same memory they used before the k8s upgrade
We are setting both, Java heap and resource limits in k8s. But, after the upgrade to 1.25, the workloads started to use 2 or 3 times more memory.
Can you share how you were setting the heap size before you upgraded to 11.0.18
?
We are setting both, Java heap and resource limits in k8s. But, after the upgrade to 1.25, the workloads started to use 2 or 3 times more memory.
Can you share how you were setting the heap size before you upgraded to
11.0.18
?
In JAVA_OPTS environment variable with "-Xmx"
So, what you are saying is that before 1.25, you were already setting heap size with -Xmx
, and then you saw the memory consumption increase?
So, what you are saying is that before 1.25, you were already setting heap size with
-Xmx
, and then you saw the memory consumption increase?
We always had the Xmx (before and with 1.25). When we upgraded k8s, a lot of workloads started to fail and reboot because they wanted to use much more memory (2 or 3 times more). I saw the high memory consumption because we had to drastically increase Xmx and resource limits in K8S to make the workloads work
After the modification of their images to upgrade the JRE to 11.0.18 all of them are back to normal memory usage and working fine again (with the same Xmx and resource limits we had before the upgrade to 1.25).
@ganga1980 Any updates on ama-logs pod memory footprint.
@shiva-appani-gep maybe a bit late but seems .NET 3.1 uses cgroupV1, and any .NET >5 uses cgroupV2 (being .NET 6.0 the actual LTS)
I did a test with an app that was using .net 3.1, changed to .net 6.0 and you can see the results.
The image is: Above the app with .net 3.1 (memory consumption) Below the same app with .net 6.0 (memory consumption)
Team,
Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues.
Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine.
Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue.
I have also observed that AKS 1.24.x version had
ubuntu 18
but AKS 1.25.x version hasubuntu 22
. Is this the reason behind high memory consumption.Kindly suggest.
Regards, Tanul