GoogleCloudPlatform / opentelemetry-operator-sample

Toolbox of recipes, samples, and guides for using the OpenTelemetry Operator on Google Cloud
Apache License 2.0
32 stars 16 forks source link

Reduce privileges of beyla sample #78

Closed dashpole closed 7 months ago

dashpole commented 9 months ago

We currently have this securityContext for the beyla daemonset:

https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/blob/b159122a1b3ac396ada5ccc5ac07c2a6545b9790/recipes/beyla/beyla-daemonset.yaml#L40-L42

We should try to reduce privileges in a way that still works on GKE.

Some potentially helpful links:

dashpole commented 9 months ago

I was able to remove privileged: true in https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/pull/80. I tried additionally removing the SYS_ADMIN capability as well, but that resulted in this error:

time=2024-02-20T19:52:30.855Z level=ERROR msg="Beyla couldn't find target process" error="couldn't start Process Finder: can't instantiate discovery.ProcessFinder pipeline: instantiating terminal instance \"TraceAttacher\": can't mount BPF filesystem: operation not permitted"

That was when I ran with:

            capabilities:
              add:
                - all
              drop:
                - SYS_ADMIN

That would seem to imply that even with all other capabilities, beyla can't mount the BPF filesystem without SYS_ADMIN

dashpole commented 9 months ago

I also tried to work around this limitation by having a privileged init container mount the BPF filesystem, similar to https://github.com/cilium/cilium/pull/14446/files#diff-264b5e646aa5ad3db682a4a0a9cd4b4cbbae238d88b033d340e901060c89394aR447, but I still got the same error.

       initContainers:
         # Mount the bpf fs if it is not mounted. We will perform this task
         # from a privileged container because the mount propagation bidirectional
         # only works from privileged containers.
         - name: mount-bpf-fs
           image: grafana/beyla:1.2.0
           args:
           - 'mount | grep "/sys/fs/bpf type bpf" || mount -t bpf bpf /sys/fs/bpf'
           command:
           - /bin/bash
           - -c
           - --
           securityContext:
             privileged: true
           volumeMounts:
           - name: bpffs
             mountPath: /sys/fs/bpf
             mountPropagation: Bidirectional

I also set the mountPropagation to HostToContainer in the bpffs mount for the main beyla container.

I still got this error:

time=2024-02-20T19:46:34.036Z level=ERROR msg="Beyla couldn't find target process" error="couldn't start Process Finder: can't instantiate discovery.ProcessFinder pipeline: instantiating terminal instance \"TraceAttacher\": can't mount BPF filesystem: operation not permitted"
dashpole commented 9 months ago

I suspect this might be because cilium always calls unix.mount() on /sys/fs/bpf, whereas beyla is trying to use a sub-directory that is unique to its process:

time=2024-02-20T21:18:06.495Z level=DEBUG msg="mounting BPF map pinning" component=discover.TraceAttacher path=/sys/fs/bpf/beyla-1327249

That might be to allow multiple beyla instances to run on the same host. But that isn't as useful with a daemonset...

aabmass commented 8 months ago

I'm seeing a new error when upgrading to Beyla 1.3.3 without priviledged: true https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/pull/86/files#r1513123975

aabmass commented 8 months ago

No telemetry is produced if that wasn't clear. Added a TODO to address the problem: https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/blob/1b98711162e6ce66f9d5b4b73e001451349f2b2a/recipes/beyla-golden-signals/beyla-daemonset.yaml#L46-L47

Here are the debug logs around the error:

time=2024-03-05T16:27:06.863Z level=INFO msg="system wide instrumentation. Creating a single instrumenter" component=discover.TraceAttacher
time=2024-03-05T16:27:06.863Z level=DEBUG msg="running tracer for new process" component=beyla.Instrumenter inode=287607 pid=3377 exec=/fluent-bit/bin/fluent-bit 
time=2024-03-05T16:27:06.863Z level=DEBUG msg="starting process tracer" component=ebpf.ProcessTracer path=/fluent-bit/bin/fluent-bit pid=3377
time=2024-03-05T16:27:06.863Z level=DEBUG msg="loading eBPF program" component=ebpf.ProcessTracer program=*httpfltr.Tracer PinPath=/sys/fs/bpf/beyla-256592 pid=3377 cmd=/fluent-bit/bin/fluent-bit                 
time=2024-03-05T16:27:06.923Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_accept probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_accept4)#79}"
time=2024-03-05T16:27:06.955Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_accept4 probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_accept4)#79}"
time=2024-03-05T16:27:06.992Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sock_alloc probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sock_alloc)#78}"
time=2024-03-05T16:27:07.015Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=tcp_connect probes="{Required:true Start:Kprobe(kprobe_tcp_connect)#60 End:<nil>}"
time=2024-03-05T16:27:07.029Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=tcp_recvmsg probes="{Required:true Start:Kprobe(kprobe_tcp_recvmsg)#65 End:Kprobe(kretprobe_tcp_recvmsg)#82}"
time=2024-03-05T16:27:07.064Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_clone3 probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_clone)#80}"
time=2024-03-05T16:27:07.084Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_exit probes="{Required:true Start:Kprobe(kprobe_sys_exit)#58 End:<nil>}"
time=2024-03-05T16:27:07.101Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=tcp_rcv_established probes="{Required:true Start:Kprobe(kprobe_tcp_rcv_established)#63 End:<nil>}"
time=2024-03-05T16:27:07.114Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_connect probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_connect)#81}"
time=2024-03-05T16:27:07.150Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=tcp_sendmsg probes="{Required:true Start:Kprobe(kprobe_tcp_sendmsg)#76 End:<nil>}"
time=2024-03-05T16:27:07.164Z level=DEBUG msg="going to add kprobe to function" component=ebpf.Instrumenter probes=kprobes function=sys_clone probes="{Required:true Start:<nil> End:Kprobe(kretprobe_sys_clone)#80}"
time=2024-03-05T16:27:07.186Z level=ERROR msg="couldn't trace process. Stopping process tracer" component=ebpf.ProcessTracer path=/fluent-bit/bin/fluent-bit pid=3377 error="attaching socket filter: operation not permitted"
dashpole commented 7 months ago

Prototype to remove SYS_ADMIN: https://github.com/dashpole/beyla/pull/1