Closed dashpole closed 7 months ago
I was able to remove privileged: true
in https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/pull/80. I tried additionally removing the SYS_ADMIN capability as well, but that resulted in this error:
time=2024-02-20T19:52:30.855Z level=ERROR msg="Beyla couldn't find target process" error="couldn't start Process Finder: can't instantiate discovery.ProcessFinder pipeline: instantiating terminal instance \"TraceAttacher\": can't mount BPF filesystem: operation not permitted"
That was when I ran with:
capabilities:
add:
- all
drop:
- SYS_ADMIN
That would seem to imply that even with all other capabilities, beyla can't mount the BPF filesystem without SYS_ADMIN
I also tried to work around this limitation by having a privileged init container mount the BPF filesystem, similar to https://github.com/cilium/cilium/pull/14446/files#diff-264b5e646aa5ad3db682a4a0a9cd4b4cbbae238d88b033d340e901060c89394aR447, but I still got the same error.
initContainers:
# Mount the bpf fs if it is not mounted. We will perform this task
# from a privileged container because the mount propagation bidirectional
# only works from privileged containers.
- name: mount-bpf-fs
image: grafana/beyla:1.2.0
args:
- 'mount | grep "/sys/fs/bpf type bpf" || mount -t bpf bpf /sys/fs/bpf'
command:
- /bin/bash
- -c
- --
securityContext:
privileged: true
volumeMounts:
- name: bpffs
mountPath: /sys/fs/bpf
mountPropagation: Bidirectional
I also set the mountPropagation to HostToContainer in the bpffs mount for the main beyla container.
I still got this error:
time=2024-02-20T19:46:34.036Z level=ERROR msg="Beyla couldn't find target process" error="couldn't start Process Finder: can't instantiate discovery.ProcessFinder pipeline: instantiating terminal instance \"TraceAttacher\": can't mount BPF filesystem: operation not permitted"
I suspect this might be because cilium always calls unix.mount() on /sys/fs/bpf, whereas beyla is trying to use a sub-directory that is unique to its process:
time=2024-02-20T21:18:06.495Z level=DEBUG msg="mounting BPF map pinning" component=discover.TraceAttacher path=/sys/fs/bpf/beyla-1327249
That might be to allow multiple beyla instances to run on the same host. But that isn't as useful with a daemonset...
I'm seeing a new error when upgrading to Beyla 1.3.3 without priviledged: true
https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/pull/86/files#r1513123975
No telemetry is produced if that wasn't clear. Added a TODO to address the problem: https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/blob/1b98711162e6ce66f9d5b4b73e001451349f2b2a/recipes/beyla-golden-signals/beyla-daemonset.yaml#L46-L47
Here are the debug logs around the error:
time=2024-03-05T16:27:06.863Z level=INFO msg="system wide instrumentation. Creating a single instrumenter" component=discover.TraceAttacher
time=2024-03-05T16:27:06.863Z level=DEBUG msg="running tracer for new process" component=beyla.Instrumenter inode=287607 pid=3377 exec=/fluent-bit/bin/fluent-bit
time=2024-03-05T16:27:06.863Z level=DEBUG msg="starting process tracer" component=ebpf.ProcessTracer path=/fluent-bit/bin/fluent-bit pid=3377
time=2024-03-05T16:27:06.863Z level=DEBUG msg="loading eBPF program" component=ebpf.ProcessTracer program=*httpfltr.Tracer PinPath=/sys/fs/bpf/beyla-256592 pid=3377 cmd=/fluent-bit/bin/fluent-bit
time=2024-03-05T16:27:06.923Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_accept probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_accept4)#79}"
time=2024-03-05T16:27:06.955Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_accept4 probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_accept4)#79}"
time=2024-03-05T16:27:06.992Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sock_alloc probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sock_alloc)#78}"
time=2024-03-05T16:27:07.015Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=tcp_connect probes="{Required:true Start:Kprobe(kprobe_tcp_connect)#60 End:<nil>}"
time=2024-03-05T16:27:07.029Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=tcp_recvmsg probes="{Required:true Start:Kprobe(kprobe_tcp_recvmsg)#65 End:Kprobe(kretprobe_tcp_recvmsg)#82}"
time=2024-03-05T16:27:07.064Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_clone3 probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_clone)#80}"
time=2024-03-05T16:27:07.084Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_exit probes="{Required:true Start:Kprobe(kprobe_sys_exit)#58 End:<nil>}"
time=2024-03-05T16:27:07.101Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=tcp_rcv_established probes="{Required:true Start:Kprobe(kprobe_tcp_rcv_established)#63 End:<nil>}"
time=2024-03-05T16:27:07.114Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_connect probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_connect)#81}"
time=2024-03-05T16:27:07.150Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=tcp_sendmsg probes="{Required:true Start:Kprobe(kprobe_tcp_sendmsg)#76 End:<nil>}"
time=2024-03-05T16:27:07.164Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_clone probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_clone)#80}"
time=2024-03-05T16:27:07.186Z level=ERROR msg="couldn't trace process. Stopping process tracer" component=ebpf.ProcessTracer path=/fluent-bit/bin/fluent-bit pid=3377 error="attaching socket filter: operation not permitted"
Prototype to remove SYS_ADMIN: https://github.com/dashpole/beyla/pull/1
We currently have this securityContext for the beyla daemonset:
https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/blob/b159122a1b3ac396ada5ccc5ac07c2a6545b9790/recipes/beyla/beyla-daemonset.yaml#L40-L42
We should try to reduce privileges in a way that still works on GKE.
Some potentially helpful links: