Open smartaquarius10 opened 1 year ago
Also running into this issue I think, both for Java docker app as for a Debian docker. Anyone already tried with running 1.26 version to see if it got fixed?
@anthonyAdhese the cgroupV2 is a feature that is enabled in any Kubernetes v1.25 and later, is your application using Java? maybe you need to upgrade the JRE, this is not something that will be fixed in Kubernetes as is not a problem unless the application uses an old framework not compatible with cgroupsv2.
@anthonyAdhese the cgroupV2 is a feature that is enabled in any Kubernetes v1.25 and later, is your application using Java? maybe you need to upgrade the JRE, this is not something that will be fixed in Kubernetes as is not a problem unless the application uses an old framework not compatible with cgroupsv2.
Yeah, was just going through this (huge) thread and saw some suggestions about JRE, so might give that a shot, was just curious if it was a bug only in 1.25 or not. Thanks for the info.
Funny thing is though, we don't have this issue on 1.25.7 on GKE so was wondering if it was an 'Azure' exclusive bug or not.
Edit: just saw that Google is still using cgroupv1 on that version, might explain the difference.
Update for people from the internet, bumping JRE version did indeed fix the memory issues, was fixed with going to a newer JRE, we went from 11 to 17. In the end we did switch out this java component completely though but the fix was working in terms of memory.
Encountered the same issue.
Some already pointed out that we can fix this by updating JVM parameters. So if your app is stuck with Java 8 (like mine). You can try some workarounds for this
-Xmx
flag, this will prevent JVM to keep increasing memory usageUseContainerSupport
by adding this to your JVM_ARGS: -XX:-UseContainerSupport
(notice the - sign)JVM_ARGS
/ JAVA_TOOL_OPTIONS
like: -XX:+UseCGroupMemoryLimitForHeap
, -XX:MaxRAMPercentage
or any parameters that rely on cgroup to set the heap size/cpu coresThe comments in this ticket have turned into a Java containers issue but it was created as a general problem affecting Azure core pods. In my case the main memory problem is with the pods used by Azure for generating Insights monitoring data. I had to disable it and still I have to restart the nodes every week.
Has Microsoft fixed anything regarding it?
I also want to know if Microsoft has done anything about this - we are on an older AKS cluster with version 1.24.6, with small nodes, and cannot upgrade until this has been fixed.
@codigoespagueti Correct.
Please do not submit java related details. This post is dedicatedly for ama agent pods which are consuming a lot of memory.
@ganga1980 @pfrcks Any updates on this.
@smartaquarius10 we have rolled our couple of changes which help reduce the memory footprint of ama-logs pods. Can you confirm what version of ama-logs are you on? kubectl get pod
@pfrcks This one mcr.microsoft.com/azuremonitor/containerinsights/ciprod:3.1.8
We are running on the same version of the container insights (ama-logs) - mcr.microsoft.com/azuremonitor/containerinsights/ciprod:3.1.8
@pfrcks - is there a changelog somewhere we can follow, to see what changes are being made? We are stuck on 1.24.6 at the moment, and cannot upgrade until this has been reliably fixed. The big issue with that is that the 1.24 version is going out of support soon.
so i actually found this page: https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/
and upgrading our JDK 1.8 to 372 seemed to have fixed it for us. maybe it helps someone else. screenshot from the article:
@smartaquarius10 @adejongh yes that is the version. Do you see any improvement in ama-logs pod resource usage?
Additionally ama-logs pod resource consumption is a separate issue unrelated to the JDK issue being discussed above. ama-logs pod version is not dependent on aks version
@pfrcks - I want to know if I can safely upgrade our cluster to the 1.25.x versions with the new ama-logs version?
@adejongh as I mentioned above ama-logs version is independent of AKS version. My response was specifically for ama-logs pod resource consumption which is different from the AKS version issues.
We experience the same problem on AKS with kubernetes 1.25.6 running. Note the problem occurs in both a microsoft sql-server container and prometheus. Seems not only related to JVM based applications. cgroup settings match the specified kubernetes resource limits. We use this image: kubernetes.azure.com/node-image-version=AKSUbuntu-2204gen2containerd-202305.24.0
@pfrcks its taking approximately 250Mib per pod. If it is normal then we close the ticket.
Hello. I think I am running into a similar issue on a small cluster (v1.25.6
) with one Standard_B2s node, which has 4GB RAM and 2 vCPUs.
The total memory currently consumed by all of my pods (kubectl top pods --sum=true -A
) is 570Mi
.
The total memory currently consumed by my node (kubectl top no
) is 2164Mi
, and this is reading as 100% of available my memory.
Per this article, even if I should only expect ~66% of my provisioned RAM (in this case, around 2.5 GB), my pods are only consuming 570Mi
, so what is consuming the other 1.5GB RAM here (out of the 2164Mi
total)?
Is there an ETA to address this issue? It's becoming very difficult to use AKS. 80% of the cluster computing resources used by support (log and others) pods and not the business APIs
@pfrcks any update on ciprod
resource consumption fixes? We're in a similar boat to @chriscardillo where our consumption on kubectl top no
are all reporting +120% memory consumption. 2x A2s_v2 system pool nodes, 1 D2s_v2 worker node. System pool nodes are the worst with > 160% consumption.
This all appears to have happened around the v1.25.6
upgrade and is blowing up our alerts on node memory consumption constantly.
@smartaquarius10 yes this is the expected usage at present. we are continuously working on driving down our resource usage.
Thanks everyone for your valuable feedback.. As this thread has become quite lengthy, I have decided to close the issue and propose the following solutions:
ama-logs
or other kube-system
pods, please create a new GitHub issue with detailed information. This issue is likely unrelated to the upgrade, and we want to keep the issues separate.If none of the above suggestions work, please create a support case and provide all relevant details.
Thank you for your understanding
So the solution is basically "pay us more monthly for VMs" after the upgrade to 1.25.x.
Got it.
@aritraghosh - how on earth can you close this issue? I need to know that I can safely upgrade our < 1.25 cluster to >= 1.25, and and that it will still run. It is not really something I can downgrade once I find out it is not working anymore. Our clusters are oldish, and do not allow us to add nodepools with larger nodes - so if we upgrade and this does not work... then it won't help to log an issue.
This is the memory usage in 1.24.6 - will it be the same in > 1.25?
@adejongh as mentioned above, ama-logs resource usage is not dependent on AKS version so yes it will be the same.
Moved my question to a separate issue: https://github.com/Azure/AKS/issues/3715
I am running very, very little on a single-node cluster and my memory is maxed out.
Closing the issue was a bad call. I do not think the main concern was the particular version of the AKS, but the system is using too many computing resources for the ama-logs, ama-logs-rs, and order support pods. This makes it impossible to use Standard_B2s nodes and use container insights.
As can be read on the first messages, the issues with the ama-log pods started when updating to aks 1.25. If nothing has been fixed on these Azure own pods, how can the issue be considered closed? Why another issue needs to be opened?
Closing valid issues is a bit weird. Please reopen.
Alright, we've reopened.
There are multiple issues being discussed here. Let me clarify:
1) [Known issue] If your application or its runtime depends on cgroups, then it might have been impacted due to the cgroups change, and resulted in oomkills or higher memory usage. This is explained here: https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/aks-memory-saturation-after-upgrade
2) There are also users reporting that AMA-logs is the culprit for the memory usage increase. I did look at our telemetry and I don't see a meaningful increase in AMA-logs memory usage across the AKS fleet or in the clusters that upgraded to 1.25. It is true that AMA-logs was using about 400MB by default a few months ago but there have been recent changes which have reduced the usage to about 250-300 MB. There is more work in progress to optimize this further.
3) It is possible that there are some changes in how memory is being reported by cadvisor now because that data is coming from cgroups v2. We're still investigating but we don't think this actually impacts usage, but just changes reporting.
If you indeed see an increase in memory usage of AMA-logs specifically, please post the details here. We suspect this may be something to do with memory usage data returned by cgroups v2 as opposed to v1, but we're happy to investigate.
Just to note we hit this issue and actually had our service pods crashloop because of running out of memory after the upgrade.
Definitely worth noting this somewhere on upgrading, as it's can cascade and cause real issues.
Just to note we hit this issue and actually had our service pods crashloop because of running out of memory after the upgrade.
Definitely worth noting this somewhere on upgrading, as it's can cascade and cause real issues.
Are you talking about the first issue with Java and .NET? If so, where would you want us to document this best? I am guessing not everyone is aware of the tech doc we published, so open to feedback on where to put that.
To add to this, we didn't see this in our dev or stage environment we upgraded. Did some java tracing and noticed the difference between two clusters on the same version:
Both Clusters are on 1.25.6 and have matching kernel-version, os-images, and container-runtimes. Only difference is the one working is in the WestUS2 region, and the non working on is WestEurope2
A cluster we upgraded today:
NOTE: Picked up JDK_JAVA_OPTIONS: -Xlog:os+container=trace -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=50.0 -XX:+UseG1GC
[0.000s][trace][os,container] OSContainer::init: Initializing Container Support
[0.000s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][trace][os,container] No relevant cgroup controllers mounted.
openjdk version "17.0.6" 2023-01-17
OpenJDK Runtime Environment Temurin-17.0.6+10 (build 17.0.6+10)
OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)
A cluster we upgraded last week:
NOTE: Picked up JDK_JAVA_OPTIONS: -Xlog:os+container=trace -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=50.0 -XX:+UseG1GC
[0.000s][trace][os,container] OSContainer::init: Initializing Container Support
[0.000s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.000s][debug][os,container] Detected cgroups v2 unified hierarchy
[0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup//cpu.max
[0.001s][trace][os,container] Raw value for CPU quota is: max
[0.001s][trace][os,container] CPU Quota is: -1
[0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup//cpu.max
[0.001s][trace][os,container] CPU Period is: 100000
[0.001s][trace][os,container] OSContainer::active_processor_count: 8
[0.001s][trace][os,container] total physical memory: 33665449984
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup//memory.max
[0.001s][trace][os,container] Raw value for memory limit is: 536870912
[0.001s][trace][os,container] Memory Limit is: 536870912
[0.001s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 8
[0.012s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 8
[0.015s][trace][os,container] Path to /memory.current is /sys/fs/cgroup//memory.current
[0.015s][trace][os,container] Memory Usage is: 226570240
openjdk version "17.0.6" 2023-01-17
OpenJDK Runtime Environment Temurin-17.0.6+10 (build 17.0.6+10)
OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)
- There are also users reporting that AMA-logs is the culprit for the memory usage increase. I did look at our telemetry and I don't see a meaningful increase in AMA-logs memory usage across the AKS fleet or in the clusters that upgraded to 1.25. It is true that AMA-logs was using about 400MB by default a few months ago but there have been recent changes which have reduced the usage to about 250-300 MB. There is more work in progress to optimize this further.
@seguler - does this memory usage for 'ama-logs' seem normal?
This is actually on AKS 1.24.6 - here is the image info:
- There are also users reporting that AMA-logs is the culprit for the memory usage increase. I did look at our telemetry and I don't see a meaningful increase in AMA-logs memory usage across the AKS fleet or in the clusters that upgraded to 1.25. It is true that AMA-logs was using about 400MB by default a few months ago but there have been recent changes which have reduced the usage to about 250-300 MB. There is more work in progress to optimize this further.
@seguler - does this memory usage for 'ama-logs' seem normal?
This is actually on AKS 1.24.6 - here is the image info:
Yes, it looks normal (based on telemetry I am looking at). This pod captures metrics and logs for Container Insights. You can see it requests 325Mi in memory and uses about that much.
I've read through https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/aks-memory-saturation-after-upgrade but I'm still struggling to understand what actions to take and how I can make sure that the cgroups API is the cause. On a side note, I've upgraded our (internal) cluster to 1.26, it seems the memory usage of each node is down around 8% (ish) but it is still above what 1.24 used to provide.
What I'd like to identify is where the memory overhead comes from. If I sum all the memory footprint from kubectl top pod --sum=true -A
, I get 3869Mi
for all pods. But then kubectl top node
returns:
NAME | CPU(cores) | CPU% | MEMORY(bytes) | MEMORY% |
---|---|---|---|---|
aks-akspool1-44018104-vmss00000r | 186m | 9% | 4607Mi | 74% |
aks-akspool1-44018104-vmss00000s | 156m | 8% | 4472Mi | 72% |
aks-akspool1-44018104-vmss00000t | 145m | 7% | 4426Mi | 71% |
Of course one can expect an overhead for each node, but I'm not K8s-skilled enough to identify when that overhead is unreasonable, one of my suspicion being that this overhead went up after the 1.25 upgrade.
To try confirm, I've tested the following:
kubectl top
kubectl top
So, before the upgrade, e.g. in 1.24.10, I got
NAME | CPU(cores) | CPU% | MEMORY(bytes) | MEMORY% |
---|---|---|---|---|
aks-agentpool-54712384-vmss000000 | 378m | 19% | 1128Mi | 52% |
aks-agentpool-54712384-vmss000001 | 118m | 6% | 1092Mi | 50% |
NAMESPACE NAME CPU(cores) MEMORY(bytes) default azure-vote-back-7cd69cc96f-lbqmz 2m 14Mi default azure-vote-front-7c95676c68-b8f4d 2m 51Mi kube-system azure-ip-masq-agent-48p8k 1m 12Mi kube-system azure-ip-masq-agent-p5hkm 1m 13Mi kube-system cloud-node-manager-hh5r4 1m 14Mi kube-system cloud-node-manager-xkbq6 1m 14Mi kube-system coredns-589487654b-c94bh 2m 18Mi kube-system coredns-589487654b-qlsjq 2m 17Mi kube-system coredns-autoscaler-5866788c6c-gzs79 1m 7Mi kube-system csi-azuredisk-node-84ftc 2m 42Mi kube-system csi-azuredisk-node-rqd8c 2m 41Mi kube-system csi-azurefile-node-226g5 2m 40Mi kube-system csi-azurefile-node-pfqds 2m 39Mi kube-system konnectivity-agent-cdcdf754f-fxbtb 1m 11Mi kube-system konnectivity-agent-cdcdf754f-vzdfk 2m 12Mi kube-system kube-proxy-d8wqd 1m 20Mi kube-system kube-proxy-rdpwg 1m 18Mi kube-system metrics-server-564bfb87fd-dht67 3m 40Mi kube-system metrics-server-564bfb87fd-h78bf 3m 36Mi ________ ________ 20m 467Mi
And after the upgrade, e.g. in 1.25.6:
NAME | CPU(cores) | CPU% | MEMORY(bytes) | MEMORY% |
---|---|---|---|---|
aks-agentpool-54712384-vmss000000 | 117m | 6% | 1695Mi | 78% |
aks-agentpool-54712384-vmss000001 | 110m | 5% | 1227Mi | 57% |
NAMESPACE NAME CPU(cores) MEMORY(bytes) default azure-vote-back-7cd69cc96f-56t2b 2m 7Mi default azure-vote-front-7c95676c68-2x8pc 1m 43Mi kube-system azure-ip-masq-agent-lfp5v 1m 13Mi kube-system azure-ip-masq-agent-zg2c7 1m 13Mi kube-system cloud-node-manager-2fnv2 1m 20Mi kube-system cloud-node-manager-rvxcj 1m 18Mi kube-system coredns-589487654b-p6sxv 2m 27Mi kube-system coredns-589487654b-smhj8 2m 16Mi kube-system coredns-autoscaler-5866788c6c-t8xnt 1m 3Mi kube-system csi-azuredisk-node-5ldsq 1m 47Mi kube-system csi-azuredisk-node-9d6gl 2m 51Mi kube-system csi-azurefile-node-rnr4z 1m 45Mi kube-system csi-azurefile-node-sjvkz 1m 49Mi kube-system konnectivity-agent-5b9f455564-f2b86 1m 13Mi kube-system konnectivity-agent-5b9f455564-g2b7h 2m 16Mi kube-system kube-proxy-b5jgk 1m 25Mi kube-system kube-proxy-c2bfn 1m 26Mi kube-system metrics-server-564bfb87fd-tm8c4 3m 26Mi kube-system metrics-server-564bfb87fd-v2l5x 3m 24Mi ________ ________ 17m 493Mi
Given the nature of B2s VMs, it's finicky to rely on an instant memory metric, so any pointer to better analyse is welcome! I've also looked at the memory working set from the portal, and I can see the increased memory usage:
I expect that B2s are particularly affected by that memory issue, since they "only" offer 4GiB of ram, thus I guess it's easier to reach OOM limits. Our production workloads don't run on them, but B2s are quite valuable to run internal clusters with very low usage at a reasonable cost.
To build on my previous comment, I did another test, where I created a "bare" 1.24.10 cluster, and after a while I upgraded it to 1.25 then 1.26. I didn't deploy anything to it, I just looked at the reported memory working set on the portal:
Edit: this is running with 3 B2s VMs, and I am running clusters in the West Europe region
Thank you all for your contributions, I leave you my solutions.
I will divide it into 2 parts:
First part:
Since we updated the development environment to version 1.25.3 we saw the memory increase problem and found that JAVA developments were the most affected. We put a support ticket to Microsoft and their response was: "We don't understand what is happening, in our environments everything worked very well" Hahaha what a funny response.
Finally, after hours of searching we found the issue of CgruopV2, so we decided to return the machines to Cgroup V1 with the following procedure:
vi /etc/default/grub
Update these lines:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash systemd.unified_cgroup_hierarchy=0"
GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"
Update the Grub of the OS
update-grub
Restart.
NOTE: Remember that by doing this, Microsoft will immediately tell you that they cannot support the OS because we modified it.
It is important to make it clear that the best solution is to update your projects to a version of JAVA, NET and Nodejs supported by CgroupV2. In our case, we have more than a thousand microservices and it is super complex to orchestrate everything with the development team.
Second part.
When we updated to version 1.25.5 we already had the workaround for this problem, but we realized that in this version the first solution was not 100% effective.
The nodes gradually increased their memory usage until they died.
Another week with this problem. And we found that the containerd defaults to using cgroupV2, so we patched it so it didn't use it.
the procedure is this:
edit the /etc/containerd/config.yaml file
Change the parameter:
SystemdCgroup = true to SystemdCgroup = false
Restart Containerd and Kubelet.
This workaround helped us, now our nodes are working normally.
I hope these workarounds will help you, I reiterate the best solution and in order not to run the risk of losing support with Microsoft, it is better to update Java, Net and Nodejs to a version supported by cgroup v2.
greetings.
Seems we have same issue. We can not try above solution as mentioned by @jjader11 We have non-prod clusters shutting down every day to save the cost. So every restart will remove the settings. And to update the images to latest versions, might take some time. We have ticket open for 5 days with Microsoft and its a big mess-up from AKS side.
https://github.com/Azure/AKS/issues/3715#issuecomment-1610339833
you could potentially confirm this is the issue by checking /proc/meminfo
and comparing with my math in that comment - would definitely help validate/reject the theory.
likely would need a cadvisor/libcontainer fix, if correct.
Trust me after trying every path I finally created a new cluster with heavy machines of E series. Memory usage is still a problem but what to do. If Cloud service providers would have added some validation like whosoever is using light machines like b series should not be allowed to update to 1.25.x series. Through this lot of time and effort could have saved.
I don’t think that this could be the difficult task for The cloud providers to figure out which customer is using what sku of nodes in AKS.
Its been 1.5 month I am working on migration which is an unnecessary effort just because of this issue.
Trust me after trying every path I finally created a new cluster with heavy machines of E series. Memory usage is still a problem but what to do. If Cloud service providers would have added some validation like whosoever is using light machines like b series should not be allowed to update to 1.25.x series. Through this lot of time and effort could have saved.
I don’t think that this could be the difficult task for The cloud providers to fetch that which customer is using what sku of nodes in AKS.
Its been 1.5 month I am working on migration which is an unnecessary effort just because of this issue.
In my scenario I have two node pools, one of which has heavy machines for the apps and other one for support, which is B series
And it is exactly the B series which keeps going off the roof with memory usage, which in theory isn't running anything other than default AKS resources
@motizukilucas trust me if you want immediate solution.. Add one new node pool upgrade machines a little and transfer apps from B series to new node pool and delete that node pool. But, make sure to do that when no one is using apps in that cluster as SNAT exhaustion can happen unless using NAT gateway
A fix is being discussed here: https://github.com/kubernetes/kubernetes/issues/118916
We don't expect this to be resolved soon in the K8s 1.25 version. Users who upgrade to 1.25 will observe higher reported memory utilization (400-500MB in our tests on idle nodes). This shouldn't impact most users as this is an increase in what's reported. However, if your nodes were close to MemoryPressure, then you may likely observe pod evictions related to it because of the accounting problem. If you do, we recommend you increase requested memory and scale up your nodepools accordingly until a fix is delivered.
@seguler scale up of node is not possible for everyone because it is highly dependent on subnet space.
But I request microsoft to add some validation to disable the usage of b2s and d2s machines(i.e. 4 to 7 gb ram) if user is selecting the version >=1.25.x especially if the subnet cidr is /24.
As I am facing challenge with D2s in production with this cidr. Just a suggestion. Thank you.
Can anyone confirm if 1.24.x version of AKS is still on cgroups 1 or it is also having that cgroups v2.. Would be grateful if someone confirm please as early as possible otherwise our production cluster will be impacted.
Can anyone confirm if 1.24.x version of AKS is still on cgroups 1 or it is also having that cgroups v2.. Would be grateful if someone confirm please as early as possible otherwise our production cluster will be impacted.
Hi man,
No, Cgroup V2 is in Ubuntu 22.04, and I remember that Kubernetes 1.24.x is working on ubuntu 20.04.
Thanks
@jjader11 : Found an example supported by microsoft itself : https://github.com/Azure/AKS/tree/master/examples/cgroups
Important notes: Please apply this to one nodepool at a time (via node selectors or node affinity rules) as it will reboot all nodes at a time if you don't specify how to roll it out.
Somebody said something about "SNAT exhaustion" when creating a new nodepool, I dont know if this is something that we need to watch out for.. I dont want to spend time on understanding SNAT exhaustion so will apply the daeomenset one node at a time. Using kubectl taint, kubectl label and restarting deployments to make sure nothing is running on the node that i plan to deploy the code to.
An alternative to forcing cgroups v1 via DaemonSet, is to create an agentpool with osSKU
set to AzureLinux
or CBLMariner
.
CBL Mariner v2 and v1 has systemd.unified_cgroup_hierarchy=0
in /boot/systemd.cfg
. See also the change comment here: https://github.com/microsoft/CBL-Mariner/blob/2.0/SPECS/systemd/systemd-bootstrap.spec#L292
Team,
Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues.
Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine.
Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue.
I have also observed that AKS 1.24.x version had
ubuntu 18
but AKS 1.25.x version hasubuntu 22
. Is this the reason behind high memory consumption.Kindly suggest.
Regards, Tanul