emilioalvap commented 2 years ago

Summary

Version: docker.elastic.co/beats/heartbeat:8.1.0.
Operating System: Talos v1.1 / K8s v1.23.5 / kubectl v1.22.5.
Discuss Forum URL: https://discuss.elastic.co/t/heartbeat-8-1-0-kubernetes-autodiscovery-memory-leak/300543

Related to #31115.

Initial investigation seems to point out there's a memory leak in Heartbeat side of K8s autodiscovery provider. Even when no pods/services are matched by the provider.

$ kubectl --namespace elastic-stack top pods | grep beat
heartbeat-96f6fc9fd-rjsch                 110m         451Mi
metricbeat-6b84bddf96-bthlp               3m           52Mi

Memory allocation grows with each pod deployed until container reaches the specified limit and is forcefully restarted. This issue prevents monitoring a small number of replicas in a cluster with constant activity.

Steps to Reproduce:

Create an initial deployment with:
- 1 heartbeat pod with an autodiscovery provider configured.
- N "dummy" pods. These will just be used to simulate K8s autodiscovery load by being created/torn down.
Here's an example:
- heartbeat-deployment.yml
- dummy-deployment.yml
To deploy:
```
$ kubectl create -f heartbeat-deployment.yml
```
Initially there will be no monitors reported in Kibana and hertbeat pod memory will be minimal and stable:
```
$ kubectl --namespace elastic-stack top pods | grep beat
heartbeat-7f566cd9fb-9rx9b   1m           36Mi
```
This is because we have specified a condition in autodiscovery provider that will not match any of the containers we will be creating afterwards, so no monitors are generated:
```
templates:
        - condition:
            contains:
              kubernetes.io/description: "YYYY"
```
Start generating K8s autodiscovery load by deploying dummy-deployment.yml continuously and checking reported memory usage after each deployment:
```
$ kubectl replace --force -f dummy-deployment.yml
configmap/heartbeat-low replaced
deployment.apps/elastic-deployment-low replaced

$ kubectl --namespace elastic-stack top pods | grep beat
heartbeat-7f566cd9fb-9rx9b   21m          39Mi

....

$ kubectl --namespace elastic-stack top pods | grep beat
heartbeat-7f566cd9fb-9rx9b   71m          194Mi
```
The amount of memory allocated after each deployment is directly influenced by two factors:
1. The number of replicas specified in dummy-deployment.yml: replicas: 50
2. The number of heartbeat autodiscovery providers configured, in the provided example there's only one but multiple providers cause even greater memory usage.
Eventually, once container reaches max. memory allowed, it will be forcefully restarted.

Tip: The provided example configuration has --httprof enabled for the container, so we can check memory allocation graphs while it's running by forwarding ports and using pprof tool:

$ kubectl --namespace elastic-stack port-forward heartbeat-7f566cd9fb-9rx9b 6060:6060
Forwarding from 127.0.0.1:6060 -> 6060
Forwarding from [::1]:6060 -> 6060

(Separate session)
$ go tool pprof -http=:8080 http://cc88-81-35-33-226.ngrok.io/debug/pprof/heap
Fetching profile over HTTP from http://cc88-81-35-33-226.ngrok.io/debug/pprof/heap
Saved profile in /home/jumpeax/pprof/pprof.heartbeat.alloc_objects.alloc_space.inuse_objects.inuse_space.010.pb.gz
Serving web UI on http://localhost:8080

elasticmachine commented 2 years ago

Pinging @elastic/uptime (Team:Uptime)

ChrsMark commented 2 years ago

Pinging @gizas @mlunadia @rameshelastic for awareness.

emilioalvap commented 2 years ago

@ChrsMark, I've recorded the heap profiles you requested, it's all in the zip file. CC @andrewvc

Beats initial state:

$ kubectl top pods | grep beat
filebeat-5bdc87b777-6hwcx     1m           45Mi
heartbeat-66b5567fc8-n98qs    2m           38Mi
metricbeat-85cf558f5c-5qxhq   1m           58Mi

After several deployments ( ~10 minutes re-deploying dummy-deployment.ymlevery 20-30 secs):

$ kubectl top pods | grep beat
filebeat-5bdc87b777-6hwcx     64m          372Mi
heartbeat-66b5567fc8-n98qs    66m          402Mi
metricbeat-85cf558f5c-5qxhq   107m         411Mi

After ~60 minutes without deployment activity:

$ kubectl top pods | grep beat
filebeat-5bdc87b777-6hwcx     4m           101Mi
heartbeat-66b5567fc8-n98qs    9m           134Mi
metricbeat-85cf558f5c-5qxhq   5m           134Mi

Files:

emilioalvap commented 2 years ago

I'll be removing Team:uptime assigment since we have agreed a potential solution for this issue is not in the board for us right now. I'll leave it to Team:Cloudnative-Monitoring to prioritise it.

botelastic[bot] commented 8 months ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

elastic / beats

[Heartbeat] K8s autodiscovery memory leak #31283

Summary

Steps to Reproduce: