coroot / coroot-node-agent

A Prometheus exporter based on eBPF that gathers comprehensive container metrics
https://coroot.com/docs/metrics/node-agent
Apache License 2.0
332 stars 61 forks source link

Support for 4.x kernels has been dropped? #106

Open FutureMatt opened 4 months ago

FutureMatt commented 4 months ago

I can't see anything obvious in the changelogs but it looks like at some point after 1.18.9 support for Linux 4.x Kernels was dropped. We currently run some clusters that have a combination of 4.19.0-19 and 5.10.0-29 kernels but the clusters with 4.x kernels are now failing do deploy the node agent with the following log output.

I0705 09:18:07.156568   85825 net.go:20] whitelisted public IPs: [0.0.0.0/0]
I0705 09:18:07.156905   85825 net.go:32] ephemeral-port-range: 32768-60999
I0705 09:18:07.164387   85825 cilium.go:30] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct4_global: no such file or directory
I0705 09:18:07.164448   85825 cilium.go:36] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_ct6_global: no such file or directory
I0705 09:18:07.164460   85825 cilium.go:43] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v2: no such file or directory
I0705 09:18:07.164472   85825 cilium.go:43] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb4_backends_v3: no such file or directory
I0705 09:18:07.164483   85825 cilium.go:52] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v2: no such file or directory
I0705 09:18:07.164491   85825 cilium.go:52] Unable to get object /proc/1/root/sys/fs/bpf/tc/globals/cilium_lb6_backends_v3: no such file or directory
I0705 09:18:07.167570   85825 main.go:111] agent version: 1.20.3
I0705 09:18:07.167635   85825 main.go:117] hostname: xxxxxxxxx-worker-1
I0705 09:18:07.167644   85825 main.go:118] kernel version: 4.19.0-18-amd64
I0705 09:18:07.169872   85825 main.go:75] machine-id:  xxxxxxxxxxxxxxxxx
I0705 09:18:07.169971   85825 tracing.go:37] OpenTelemetry traces collector endpoint: http://coroot:8080/v1/traces
I0705 09:18:07.170090   85825 otel.go:29] OpenTelemetry logs collector endpoint: http://coroot:8080/v1/logs
I0705 09:18:07.170401   85825 metadata.go:67] cloud provider:
I0705 09:18:07.170419   85825 collector.go:157] instance metadata: <nil>
I0705 09:18:07.170670   85825 profiling.go:52] profiles endpoint: http://coroot:8080/v1/profiles
E0705 09:18:07.198354   85825 profiling.go:100] load bpf objects: field DisassociateCtty: program disassociate_ctty: apply CO-RE relocations: load kernel spec: no BTF found for kernel version 4.19.0-18-amd64: not supported
E0705 09:18:07.198354   85825 profiling.go:100] load bpf objects: field DisassociateCtty: program disassociate_ctty: apply CO-RE relocations: load kernel spec: no BTF found for kernel version 4.19.0-18-amd64: not supported
E0705 09:18:07.198354   85825 profiling.go:100] load bpf objects: field DisassociateCtty: program disassociate_ctty: apply CO-RE relocations: load kernel spec: no BTF found for kernel version 4.19.0-18-amd64: not supported
E0705 09:18:07.198354   85825 profiling.go:100] load bpf objects: field DisassociateCtty: program disassociate_ctty: apply CO-RE relocations: load kernel spec: no BTF found for kernel version 4.19.0-18-amd64: not supported
I0705 09:18:10.202542   85825 containerd.go:38] using /run/containerd/containerd.sock
W0705 09:18:10.202604   85825 registry.go:85] stat /proc/1/root/var/run/crio/crio.sock: no such file or directory
W0705 09:18:10.202604   85825 registry.go:85] stat /proc/1/root/var/run/crio/crio.sock: no such file or directory
E0705 09:18:10.234982   85825 tracer.go:191] load program: argument list too long:
E0705 09:18:10.234982   85825 tracer.go:191] load program: argument list too long:
E0705 09:18:10.234982   85825 tracer.go:191] load program: argument list too long:
E0705 09:18:10.234982   85825 tracer.go:191] load program: argument list too long:
F0705 09:18:10.235037   85825 main.go:149] failed to load collection: program sys_enter_sendmmsg: load program: argument list too long
F0705 09:18:10.235037   85825 main.go:149] failed to load collection: program sys_enter_sendmmsg: load program: argument list too long
F0705 09:18:10.235037   85825 main.go:149] failed to load collection: program sys_enter_sendmmsg: load program: argument list too long
F0705 09:18:10.235037   85825 main.go:149] failed to load collection: program sys_enter_sendmmsg: load program: argument list too long
F0705 09:18:10.235037   85825 main.go:149] failed to load collection: program sys_enter_sendmmsg: load program: argument list too long
def commented 4 months ago

It wasn't intentional. We added an eBPF program with more instructions than the others. Kernel 4.19 has a lower limit for the number of instructions in eBPF programs

FutureMatt commented 4 months ago

Are there plans to try and support 4.x kernels again or should the minimum requirements listed in the readme be updated?

It uses eBPF to track container related events such as TCP connects, so the minimum supported Linux kernel version is 4.16.

guolifu commented 4 months ago

It seems to be caused by this code, which will unfold two very long instructions.


SEC("tracepoint/syscalls/sys_enter_sendmmsg")
int sys_enter_sendmmsg(struct trace_event_raw_sys_enter_rw__stub* ctx) {
    __u64 offset = 0;
    #pragma unroll
    for (int i = 0; i <= 1; i++) {
        if (i >= ctx->size) {
            break;
        }
        struct mmsghdr h = {};
        if (bpf_probe_read(&h , sizeof(h), (void *)(ctx->buf + offset))) {
            return 0;
        }
        offset += sizeof(h);
        trace_enter_write(ctx, ctx->fd, 0, (char*)h.msg_hdr.msg_iov, 0, h.msg_hdr.msg_iovlen);
    }
    return 0;
}