elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.06k stars 4.89k forks source link

[Heartbeat] K8s autodiscovery memory leak #31283

Closed emilioalvap closed 2 months ago

emilioalvap commented 2 years ago

Summary

Related to #31115.

Initial investigation seems to point out there's a memory leak in Heartbeat side of K8s autodiscovery provider. Even when no pods/services are matched by the provider.

$ kubectl --namespace elastic-stack top pods | grep beat
heartbeat-96f6fc9fd-rjsch                 110m         451Mi
metricbeat-6b84bddf96-bthlp               3m           52Mi

Memory allocation grows with each pod deployed until container reaches the specified limit and is forcefully restarted. This issue prevents monitoring a small number of replicas in a cluster with constant activity.

Steps to Reproduce:

  1. Create an initial deployment with:

    • 1 heartbeat pod with an autodiscovery provider configured.
    • N "dummy" pods. These will just be used to simulate K8s autodiscovery load by being created/torn down.

    Here's an example:

    To deploy:

    $ kubectl create -f heartbeat-deployment.yml

    Initially there will be no monitors reported in Kibana and hertbeat pod memory will be minimal and stable:

    $ kubectl --namespace elastic-stack top pods | grep beat
    heartbeat-7f566cd9fb-9rx9b   1m           36Mi

    image

    This is because we have specified a condition in autodiscovery provider that will not match any of the containers we will be creating afterwards, so no monitors are generated:

    templates:
            - condition:
                contains:
                  kubernetes.io/description: "YYYY"
  2. Start generating K8s autodiscovery load by deploying dummy-deployment.yml continuously and checking reported memory usage after each deployment:

    $ kubectl replace --force -f dummy-deployment.yml
    configmap/heartbeat-low replaced
    deployment.apps/elastic-deployment-low replaced
    
    $ kubectl --namespace elastic-stack top pods | grep beat
    heartbeat-7f566cd9fb-9rx9b   21m          39Mi
    
    ....
    
    $ kubectl --namespace elastic-stack top pods | grep beat
    heartbeat-7f566cd9fb-9rx9b   71m          194Mi

    The amount of memory allocated after each deployment is directly influenced by two factors:

    1. The number of replicas specified in dummy-deployment.yml: replicas: 50
    2. The number of heartbeat autodiscovery providers configured, in the provided example there's only one but multiple providers cause even greater memory usage.
  3. Eventually, once container reaches max. memory allowed, it will be forcefully restarted.

Tip: The provided example configuration has --httprof enabled for the container, so we can check memory allocation graphs while it's running by forwarding ports and using pprof tool:

$ kubectl --namespace elastic-stack port-forward heartbeat-7f566cd9fb-9rx9b 6060:6060
Forwarding from 127.0.0.1:6060 -> 6060
Forwarding from [::1]:6060 -> 6060

(Separate session)
$ go tool pprof -http=:8080 http://cc88-81-35-33-226.ngrok.io/debug/pprof/heap
Fetching profile over HTTP from http://cc88-81-35-33-226.ngrok.io/debug/pprof/heap
Saved profile in /home/jumpeax/pprof/pprof.heartbeat.alloc_objects.alloc_space.inuse_objects.inuse_space.010.pb.gz
Serving web UI on http://localhost:8080
elasticmachine commented 2 years ago

Pinging @elastic/uptime (Team:Uptime)

ChrsMark commented 2 years ago

Pinging @gizas @mlunadia @rameshelastic for awareness.

emilioalvap commented 2 years ago

@ChrsMark, I've recorded the heap profiles you requested, it's all in the zip file. CC @andrewvc

Files:

emilioalvap commented 2 years ago

I'll be removing Team:uptime assigment since we have agreed a potential solution for this issue is not in the board for us right now. I'll leave it to Team:Cloudnative-Monitoring to prioritise it.

botelastic[bot] commented 8 months ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!