coroot / coroot-node-agent

A Prometheus exporter based on eBPF that gathers comprehensive container metrics
https://coroot.com/docs/metrics/node-agent
Apache License 2.0
320 stars 59 forks source link

OOM kill agent #11

Closed kkopanidis closed 1 year ago

kkopanidis commented 1 year ago

Hello coroot team, I have another issue, now centered on the node-agent. Attached metrics from lens: c55edbfc-618d-4f6d-8737-7309f7fc1104 5fc153cf-d4b2-47fa-a1b8-8b7e6779987e 36d2fa8d-6077-4de1-995a-250a5c598204

We are reaching the 1GB default memory limit on 3 nodes. We have coroot running in 3-4 different clusters and this behaviour appears only in one. We have tried looking into node differences but we didn't find anything. Do you have any idea why these spikes might be happening and how it could be mitigated?

def commented 1 year ago

@kkopanidis thank you for the report! Can you please share the CPU and Memory profiles of an affected agent?

curl -o mem_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/heap'
curl -o cpu_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/profile?seconds=60'
kkopanidis commented 1 year ago

cpu_profile.tgz mem_profile.tgz There you go!

def commented 1 year ago

No, something went wrong. These files contain only 404 page not found.

kkopanidis commented 1 year ago

cpu_profile.tgz mem_profile.tgz Yes my bad, these seem correct

def commented 1 year ago

@kkopanidis could you please update coroot-node-agent to version 1.7.1 and check its memory and CPU consumption?

kkopanidis commented 1 year ago

Works much better, CPU on ~10% and memory at ~700mb (for the pod that had issues before). Will continue monitoring and let you know, if it starts misbehaving again. Thanks for the assistance