coroot / coroot-node-agent

A Prometheus exporter based on eBPF that gathers comprehensive container metrics
https://coroot.com/docs/metrics/node-agent
Apache License 2.0
311 stars 55 forks source link

The metrics interface of some nodes cannot respond normally #44

Closed wenhuwang closed 9 months ago

wenhuwang commented 9 months ago

Env

# k get node -owide
NAME          STATUS   ROLES    AGE      VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
10.165.6.25   Ready    node     584d     v1.19.4   10.165.6.25   <none>        Ubuntu 18.04.6 LTS      5.4.187-0504187-generic       docker://19.3.13
10.165.6.26   Ready    node     581d     v1.19.4   10.165.6.26   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
10.165.6.27   Ready    node     581d     v1.19.4   10.165.6.27   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
10.165.8.23   Ready    node     109d     v1.19.4   10.165.8.23   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
....

# helm -n coroot list
NAME    NAMESPACE   REVISION    UPDATED                                 STATUS      CHART           APP VERSION
coroot  coroot      1           2023-10-25 15:48:12.919318 +0800 CST    deployed    coroot-0.5.1    0.21.0

# k -n coroot get pods -owide | grep coroot-node-agent
coroot-node-agent-249ws                           1/1     Running   0          36m     10.165.208.69    10.165.6.27   <none>           <none>
coroot-node-agent-6bxlb                           1/1     Running   0          4h27m   10.165.204.252   10.165.8.23   <none>           <none>
coroot-node-agent-tfhdw                           1/1     Running   6          4h27m   10.165.210.2     10.165.6.26   <none>           <none>
coroot-node-agent-89xqp                           1/1     Running   7          4h26m   10.165.202.98    10.165.6.25   <none>           <none>

Description

the some nodes coroot-node-agent status was done.

image

the cpu profile shows that the netlink.AddrList function takes up more than 70% of the cpu time.

image

abnormal coroot-node-agent pods cpu usage is about 2.5C, normal pod cp usage is about 0.2C

image

all node configurations and pod numbers are similar, and please help me troubleshoot the problem.

apetruhin commented 9 months ago

@wenhuwang, could you please share the CPU and Memory profiles of an affected agent?

curl -o mem_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/heap'
curl -o cpu_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/profile?seconds=60'
wenhuwang commented 9 months ago

cpu_profile.tgz mem_profile.tgz @apetruhin

wenhuwang commented 9 months ago

goroutine_profile.tgz goroutine leaked?

image
apetruhin commented 9 months ago

@wenhuwang, Could you please update coroot-node-agent to the latest version, 1.14.1 (the helm chart has also been updated), and verify if this issue has been resolved?

wenhuwang commented 9 months ago

@wenhuwang, Could you please update coroot-node-agent to the latest version, 1.14.1 (the helm chart has also been updated), and verify if this issue has been resolved?

OK, I will try.

wenhuwang commented 9 months ago

@apetruhin this issues has been solved after upgrading to the latest version, thank you.

apetruhin commented 9 months ago

@wenhuwang, thank you for providing such comprehensive details on the issue.