flashcatcloud / categraf

one-stop telemetry collector for nightingale
https://flashcat.cloud/docs/
MIT License
824 stars 253 forks source link

修改 hostname 引发 categraf 内存泄露 #946

Closed fangpsh closed 4 months ago

fangpsh commented 4 months ago

Relevant config.toml

[global]
# whether print configs
print_configs = false

# add label(agent_hostname) to series
# "" -> auto detect hostname
# "xx" -> use specified string xx
# "$hostname" -> auto detect hostname
# "$ip" -> auto detect ip
# "$sn" -> auto detect bios serial number
# "$hostname-$ip" -> auto detect hostname and ip to replace the vars
hostname = "$ip"

....

Logs from categraf

2024/05/23 03:58:04 agent.go:49: I! agent started
2024/05/23 03:58:04 metrics_reader.go:54: D! local.net : before gather once
2024/05/23 03:58:04 metrics_reader.go:54: D! local.netstat : before gather once
2024/05/23 03:58:04 metrics_reader.go:54: D! local.mem : before gather once
2024/05/23 03:58:04 metrics_reader.go:54: D! local.processes : before gather once
2024/05/23 03:58:04 metrics_reader.go:54: D! local.linux_sysctl_fs : before gather once
2024/05/23 03:58:04 metrics_reader.go:54: D! local.self_metrics : before gather once
2024/05/23 03:58:04 metrics_reader.go:54: D! local.nfsclient : before gather once
2024/05/23 03:58:04 metrics_reader.go:60: D! local.nfsclient : after gather once, duration: 121.99µs
2024/05/23 03:58:04 metrics_reader.go:54: D! local.system : before gather once
Killed

System info

categraf 0.3.66, debian12

Docker

No response

Steps to reproduce

  1. 将 config.toml 的 hostname= 添加任意,例如xx,或者$hostname

Expected behavior

Actual behavior

内存持续增长,最终 OOM。

只在这台设备上出现过,其他 centos、ubuntu 均正常。 各种二分法测后,发现即使在默认配置下,只要修改 config.toml 中的hostname,百分之百复现。由于oom 时间太快,来不及抓 pprof。

Additional info

top: 64819 root 20 0 5765364 3.9g 70692 R 398.7 50.1 0:31.46 categraf 58593 root 20 0 14300 2676 556 R 0.3 0.0 2:15.66 top

dmesg:

[1881908.835578] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-6.scope,task=categraf,pid=64819,uid=0 [1881908.837467] Out of memory: Killed process 64819 (categraf) total-vm:11096884kB, anon-rss:7433276kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:15212kB oom_score_adj:0

fangpsh commented 4 months ago
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Linux 6.1.0-20-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.85-1 (2024-04-11) x86_64 GNU/Linux
wudihechao commented 4 months ago

麻烦试下3.67版本,应该是解决这个问题了。

fangpsh commented 4 months ago

麻烦试下3.67版本,应该是解决这个问题了。

确定 v3.67 下已修复。