coroot / coroot-node-agent

A Prometheus exporter based on eBPF that gathers comprehensive container metrics
https://coroot.com/docs/metrics/node-agent
Apache License 2.0
311 stars 55 forks source link

Fails to start with docker run #21

Open snoby opened 1 year ago

snoby commented 1 year ago

Given the documented docker run line

docker run -it --name coroot-node-agent     --privileged --pid host     -v /sys/kernel/debug:/sys/kernel/debug:rw     -v /sys/fs/cgroup:/host/sys/fs/cgroup:ro     ghcr.io/coroot/coroot-node-agent --cgroupfs-root=/host/sys/fs/cgroup

on a ubuntu 2004 host (not running in k8s) the container starts and immediately exits.

I0609 20:47:54.413530  101556 cilium.go:29] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct4_global: no such file or directory
I0609 20:47:54.413651  101556 cilium.go:35] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct6_global: no such file or directory
I0609 20:47:54.413676  101556 cilium.go:42] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v2: no such file or directory
I0609 20:47:54.413704  101556 cilium.go:42] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v3: no such file or directory
I0609 20:47:54.413730  101556 cilium.go:51] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v2: no such file or directory
I0609 20:47:54.413764  101556 cilium.go:51] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v3: no such file or directory
I0609 20:47:54.414233  101556 main.go:81] agent version: 1.8.6
I0609 20:47:54.414328  101556 main.go:87] hostname: alpha-pg-1
I0609 20:47:54.414356  101556 main.go:88] kernel version: 5.4.0-1092-kvm
I0609 20:47:54.414461  101556 main.go:71] machine-id:  5d42852a98ec471c9d4c9ee29536a7f6
I0609 20:47:54.414509  101556 tracing.go:29] no OpenTelemetry collector endpoint configured
I0609 20:47:54.414945  101556 metadata.go:66] cloud provider:
I0609 20:47:54.415018  101556 collector.go:157] instance metadata: <nil>
I0609 20:47:57.420953  101556 containerd.go:37] using /run/containerd/containerd.sock
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist
F0609 20:47:57.531383  101556 main.go:112] failed to link program: trace event syscalls/sys_enter_read: file does not exist

I've verified that the paths are correct, that indeed /sys/kernel/debug and /sys/fs/cgroup does exist on the host.

On the host there is nothing in the root@db:~# ls /proc/1/root/sys/fs/bpf/ directory.

I'm using the latest docker image

def commented 1 year ago

@snoby, thank you for the report! I have successfully reproduced the issue on the 5.4.0-1092-kvm kernel. It appears that certain tracepoints are disabled in "-kvm" kernels.

grep CONFIG_FTRACE_SYSCALLS "/boot/config-$(uname -r)"
# CONFIG_FTRACE_SYSCALLS is not set

Unfortunately, the agent cannot function without the syscalls/* tracepoints, and currently, no workarounds come to mind.

snoby commented 1 year ago

grep CONFIG_FTRACE_SYSCALLS "/boot/config-$(uname -r)"

Thank you for following up on this. The ubuntu kernel's in AWS does indeed have this feature enabled, i will research if there is a ubuntu cloud image kernel that i can install that has this feature enabled.

Thanks for your help!

hsblhsn commented 4 months ago

Unfortunately, the agent cannot function without the syscalls/* tracepoints, and currently, no workarounds come to mind.

Can we at least add a feature to skip those tracepoints? Like --disable-syscall-tracepoints. I believe in that case we only loose the service-map. Right?

EDIT: I have tried to change the code to ignore the error. And the agent is working fine and surprisingly I can even see the service-map.