Open ChrisJD-VMC opened 3 months ago
Started at 24 Megs for the config-reader 4 days ago, today at 600 Megs:
We have the similar issue. The memory usage keeps growing and drops down suddenly
Hi - this is a known issue that we will be rolling out a fix for.
Hi @vishiy Do we have any ETA for the fix? Thanks.
Hi @vishiy
Is there any way we could free the memory of the ama-metrics-operator-targets pod like manually kill the pod?
Will killing the pod make any impacts in AKS cluster?
Hi @vishiy
Is there any way we could free the memory of the ama-metrics-operator-targets pod like manually kill the pod?
Will killing the pod make any impacts in AKS cluster?
@akari-m , I opened a support ticket for this issue. The support engineer had me delete all the pods whose name start with ama-*. The pods recreated automatically, we reclaimed over 2Gigs of memory, no impact to our application pods…
Happened today to our production cluster as well. CC @vishiy.
In general it seems like the Azure Monitor for AKS is not in a good shape. It's very convinient one-click deployment, but the stability and quality of the setup is pretty damn low for a paid product.
This started happening to us on 16th of August 2024 (see chart), both in WestEurope and WestUS clusters, all by itself. We are on 1.28.9 AKS version.
The leak causes node reboots for us, and considering this component is runnning on the system nodepool, leads all sorts of bad effects.
@vishiy is there any ETA for the fix you mentioned, or is there anything else we can do to resolve the issue?
EDIT: I also note that the targetallocator
container in ama-metrics-operator-targets
deployment has 8Gi memory limit and 5 cores. Surely these cannot be reasonable numbers?
containers:
- name: targetallocator
image: mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector/images:6.9.0-main-07-22-2024-2e3dfb56-targetallocator
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: "5"
memory: 8Gi
requests:
cpu: 10m
memory: 50Mi
We have the same issue.
Hello @vishiy . We have the same issue. Please assist
This issue also causes OOMkills for us.
Temporary solution was to create a cronjob that restarts the operator every day so that it doesn't consume a lot of memory and causes issues on the nodes.
Please notify in here once it is resolved
It has happened again, exactly 7 days after last time. I don't want this to become a weekly event in my job, so I've also created a cronjob to kill the ama-metrics-* pods.
Again, not what I'd expect from a commercial product.
Can someone post a 1-liner kubectl command to create this cron job as a temporary workaround? ;)
Cronjob
apiVersion: batch/v1
kind: CronJob
metadata:
name: kill-ama-operator-cj
namespace: kube-system
spec:
schedule: "0 6 * * *" # Runs every day at 6:00 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: kill-pod
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
POD=$(kubectl get pods -n kube-system -l rsName=ama-metrics-operator-targets -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod $POD -n kube-system
restartPolicy: OnFailure
If you want to check the Cronjob deployment:
kubectl create job test-job --from=cronjob/kill-ama-operator-cj -n kube-system
If your job fails with a kubectl error you probably need to add a serviceaccount
Can confirm this is happening on our clusters as well.
Hi - this is a known issue that we will be rolling out a fix for.
Is this ever going to be fixed or it would be my fever dream forever?
Hi, same issue 3 weeks from now. Can we have an estimated date for the fix ? @vishiy
@twanbeeren , I allowed myself to extend your version:
apiVersion: v1
kind: ServiceAccount
metadata:
name: kill-ama-metrics-operator-targets-cj-sa
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: kube-system
name: kill-ama-metrics-operator-targets-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kill-ama-metrics-operator-targets-cj-sa-binding
namespace: kube-system
subjects:
- kind: ServiceAccount
name: kill-ama-metrics-operator-targets-cj-sa
namespace: kube-system
roleRef:
kind: Role
name: kill-ama-metrics-operator-targets-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: kill-ama-metrics-operator-targets-cj
namespace: kube-system
spec:
schedule: "0 6 * * *" # Runs every day at 6:00 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: kill-ama-metrics-operator-targets-cj-sa
containers:
- name: kill-pod
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
POD=$(kubectl get pods -n kube-system -l rsName=ama-metrics-operator-targets -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod $POD -n kube-system
restartPolicy: OnFailure
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
The fix for this is rolling out currently. It should roll out to all regions by 09/30.
The fix for this is rolling out currently. It should roll out to all regions by 09/30. I am still facing this issue.
The fix for this is rolling out currently. It should roll out to all regions by 09/30. I am still facing this issue.
@sivashankaran22 - Could you provide me your cluster id?
The fix for this is rolling out currently. It should roll out to all regions by 09/30.
What is the version of the fix that your are rolling out, so we can check that has been deployed?
We still see very high memory use for all the "ama-" pods, which is really not acceptable.
Describe the bug I don't know if this is the correct place or this, if it's not please advise where to direct this issue. tldr; ama-metrics-operator-targets seems to have a memory leak (I assume it's not designed to slowly consume more and more RAM).
I got alerts from both AKS clusters I run this morning that a container in each had been OOM killed. Some investigation revealed that the containers in question were both the ama-metrics-operator-targets (Azure Managed Prometheus monitoring related, to my understanding).
Looking at the memory usage for those containers in Prometheus I can see a ramp up in memory usage over the course of probably a bit more than a week followed by the containers being killed at about 2GB of ram usage. The memory use then drops back to 60-70MB and then starts climbing again.
This is the first time this has happened. We've been using Azure Managed Prometheus for about 3 months. Given the rate the RAM usage is increasing at I assume some kind of new issue is causing this. Probably introduced in the last couple of weeks. We have not made any changes to either clusters configuration for several months. And one of the clusters hasn't had any container changes deployed by us for 3 months. Both are configured to auto update for minor cluster versions.
To Reproduce Steps to reproduce the behavior: I assume just having a cluster configured with Prometheus monitoring is enough.
Expected behavior ama-metrics-operator-targets container RAM usage does not continuously grow over time.
Screenshots 7 days ago Last night After the OOM kill occurred Climbing again
Environment (please complete the following information): CLI Version - 2.62.0 Kubernetes version - 1.29.7 and 1.30.3 Browser - chrome
Additional Info Clusters are in two different regions. Connected using AMPLS to the same Azure Monitor Workspace. One Azure Managed Prometheus instance connected to the workspace. Data still appears to be being collected and can be viewed fine in Prometheus.