Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 309 forks source link

ama-metrics-operator-targets consuming more and more cluster memory #4509

Open ChrisJD-VMC opened 3 months ago

ChrisJD-VMC commented 3 months ago

Describe the bug I don't know if this is the correct place or this, if it's not please advise where to direct this issue. tldr; ama-metrics-operator-targets seems to have a memory leak (I assume it's not designed to slowly consume more and more RAM).

I got alerts from both AKS clusters I run this morning that a container in each had been OOM killed. Some investigation revealed that the containers in question were both the ama-metrics-operator-targets (Azure Managed Prometheus monitoring related, to my understanding).

Looking at the memory usage for those containers in Prometheus I can see a ramp up in memory usage over the course of probably a bit more than a week followed by the containers being killed at about 2GB of ram usage. The memory use then drops back to 60-70MB and then starts climbing again.

This is the first time this has happened. We've been using Azure Managed Prometheus for about 3 months. Given the rate the RAM usage is increasing at I assume some kind of new issue is causing this. Probably introduced in the last couple of weeks. We have not made any changes to either clusters configuration for several months. And one of the clusters hasn't had any container changes deployed by us for 3 months. Both are configured to auto update for minor cluster versions.

To Reproduce Steps to reproduce the behavior: I assume just having a cluster configured with Prometheus monitoring is enough.

Expected behavior ama-metrics-operator-targets container RAM usage does not continuously grow over time.

Screenshots 7 days ago image Last night image After the OOM kill occurred image Climbing again image

Environment (please complete the following information): CLI Version - 2.62.0 Kubernetes version - 1.29.7 and 1.30.3 Browser - chrome

Additional Info Clusters are in two different regions. Connected using AMPLS to the same Azure Monitor Workspace. One Azure Managed Prometheus instance connected to the workspace. Data still appears to be being collected and can be viewed fine in Prometheus.

boyko11 commented 2 months ago

Started at 24 Megs for the config-reader 4 days ago, today at 600 Megs: image image

akari-m commented 2 months ago

We have the similar issue. The memory usage keeps growing and drops down suddenly

vishiy commented 2 months ago

Hi - this is a known issue that we will be rolling out a fix for.

JoeyC-Dev commented 2 months ago

Hi @vishiy Do we have any ETA for the fix? Thanks.

akari-m commented 2 months ago

Hi @vishiy

Is there any way we could free the memory of the ama-metrics-operator-targets pod like manually kill the pod?

Will killing the pod make any impacts in AKS cluster?

boyko11 commented 2 months ago

Hi @vishiy

Is there any way we could free the memory of the ama-metrics-operator-targets pod like manually kill the pod?

Will killing the pod make any impacts in AKS cluster?

@akari-m , I opened a support ticket for this issue. The support engineer had me delete all the pods whose name start with ama-*. The pods recreated automatically, we reclaimed over 2Gigs of memory, no impact to our application pods…

iakovmarkov commented 2 months ago

Happened today to our production cluster as well. CC @vishiy.

In general it seems like the Azure Monitor for AKS is not in a good shape. It's very convinient one-click deployment, but the stability and quality of the setup is pretty damn low for a paid product.

ppanyukov commented 2 months ago

This started happening to us on 16th of August 2024 (see chart), both in WestEurope and WestUS clusters, all by itself. We are on 1.28.9 AKS version.

The leak causes node reboots for us, and considering this component is runnning on the system nodepool, leads all sorts of bad effects.

@vishiy is there any ETA for the fix you mentioned, or is there anything else we can do to resolve the issue?


EDIT: I also note that the targetallocator container in ama-metrics-operator-targets deployment has 8Gi memory limit and 5 cores. Surely these cannot be reasonable numbers?

containers:
  - name: targetallocator
    image: mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector/images:6.9.0-main-07-22-2024-2e3dfb56-targetallocator
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        cpu: "5"
        memory: 8Gi
      requests:
        cpu: 10m
        memory: 50Mi

image

adejongh commented 2 months ago

We have the same issue.

AkariH commented 2 months ago

Hello @vishiy . We have the same issue. Please assist

twanbeeren commented 2 months ago

This issue also causes OOMkills for us.

Temporary solution was to create a cronjob that restarts the operator every day so that it doesn't consume a lot of memory and causes issues on the nodes.

Please notify in here once it is resolved

iakovmarkov commented 2 months ago

It has happened again, exactly 7 days after last time. I don't want this to become a weekly event in my job, so I've also created a cronjob to kill the ama-metrics-* pods.

Again, not what I'd expect from a commercial product.

deyanp commented 2 months ago

Can someone post a 1-liner kubectl command to create this cron job as a temporary workaround? ;)

twanbeeren commented 2 months ago

Cronjob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: kill-ama-operator-cj
  namespace: kube-system
spec:
  schedule: "0 6 * * *" # Runs every day at 6:00 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kill-pod
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              POD=$(kubectl get pods -n kube-system -l rsName=ama-metrics-operator-targets -o jsonpath='{.items[0].metadata.name}')
              kubectl delete pod $POD -n kube-system
          restartPolicy: OnFailure

If you want to check the Cronjob deployment: kubectl create job test-job --from=cronjob/kill-ama-operator-cj -n kube-system

If your job fails with a kubectl error you probably need to add a serviceaccount

antiphon0 commented 2 months ago

Can confirm this is happening on our clusters as well.

shiroshiro14 commented 2 months ago

Hi - this is a known issue that we will be rolling out a fix for.

Is this ever going to be fixed or it would be my fever dream forever?

martindruart commented 2 months ago

Hi, same issue 3 weeks from now. Can we have an estimated date for the fix ? @vishiy

deyanp commented 2 months ago

@twanbeeren , I allowed myself to extend your version:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: kill-ama-metrics-operator-targets-cj-sa
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: kube-system
  name: kill-ama-metrics-operator-targets-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kill-ama-metrics-operator-targets-cj-sa-binding
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: kill-ama-metrics-operator-targets-cj-sa
    namespace: kube-system
roleRef:
  kind: Role
  name: kill-ama-metrics-operator-targets-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: kill-ama-metrics-operator-targets-cj
  namespace: kube-system
spec:
  schedule: "0 6 * * *" # Runs every day at 6:00 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: kill-ama-metrics-operator-targets-cj-sa
          containers:
          - name: kill-pod
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              POD=$(kubectl get pods -n kube-system -l rsName=ama-metrics-operator-targets -o jsonpath='{.items[0].metadata.name}')
              kubectl delete pod $POD -n kube-system
          restartPolicy: OnFailure
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
rashmichandrashekar commented 2 months ago

The fix for this is rolling out currently. It should roll out to all regions by 09/30.

sivashankaran22 commented 1 month ago

The fix for this is rolling out currently. It should roll out to all regions by 09/30. I am still facing this issue.

rashmichandrashekar commented 1 month ago

The fix for this is rolling out currently. It should roll out to all regions by 09/30. I am still facing this issue.

@sivashankaran22 - Could you provide me your cluster id?

adejongh commented 1 month ago

The fix for this is rolling out currently. It should roll out to all regions by 09/30.

What is the version of the fix that your are rolling out, so we can check that has been deployed?

We still see very high memory use for all the "ama-" pods, which is really not acceptable.

image