cilium / tetragon

eBPF-based Security Observability and Runtime Enforcement
https://tetragon.io
Apache License 2.0
3.56k stars 348 forks source link

Tetragon not getting further than "Loading registered BPF probe" #1788

Open Nabsku opened 10 months ago

Nabsku commented 10 months ago

What happened?

  1. Running Raspberry Pis with k3s, the pods don't completely come up
  2. I recompiled the default kernel from the RaspiFoundation with BTF support enabled
  3. Pods start loading "/var/lib/tetragon/bpf_execve_event_v61.o" and then just get killed by the livenessProbe
  4. Increasing the time for the livenessProbe yields no result (thought the Pis were too slow)

Config is the default in the helm chart, just enabled the ServiceMonitor.

Tetragon Version

1.0.0

Kernel Version

Linux node2 6.1.63-v8+ #3 SMP PREEMPT Thu Nov 23 16:25:54 UTC 2023 aarch64 GNU/Linux

Kubernetes Version

v1.28.3+k3s2

Bugtool

bugreport.tar.gz

Relevant log output

time="2023-11-23T20:13:49Z" level=info msg="Loaded config from directory" config-dir=/etc/tetragon/tetragon.conf.d/
time="2023-11-23T20:13:49Z" level=info msg="Starting gops server" addr="localhost:8118"
time="2023-11-23T20:13:49Z" level=info msg="Starting tetragon" version=v1.0.0
time="2023-11-23T20:13:49Z" level=info msg="config settings" config="map[bpf-lib:/var/lib/tetragon/ btf: config-dir:/etc/tetragon/tetragon.conf.d/ cpuprofile: data-cache-size:1024 debug:false disable-kprobe-multi:false enable-export-aggregation:false enable-k8s-api:true enable-msg-handling-latency:false enable-pid-set-filter:false enable-pod-info:false enable-policy-filter:true enable-policy-filter-debug:false enable-process-ancestors:true enable-process-cred:false enable-process-ns:false event-queue-size:10000 export-aggregation-buffer-size:10000 export-aggregation-window-size:15s export-allowlist:{\"event_set\":[\"PROCESS_EXEC\", \"PROCESS_EXIT\", \"PROCESS_KPROBE\", \"PROCESS_UPROBE\", \"PROCESS_TRACEPOINT\"]} export-denylist:{\"health_check\":true}\n{\"namespace\":[\"\", \"cilium\", \"kube-system\"]} export-file-compress:false export-file-max-backups:5 export-file-max-size-mb:10 export-file-perm:600 export-file-rotation-interval:0s export-filename:/var/run/cilium/tetragon/tetragon.log export-rate-limit:-1 expose-kernel-addresses:false field-filters:{} force-large-progs:false force-small-progs:false gops-address:localhost:8118 k8s-kubeconfig-path: kernel: kmods:[] log-format:text log-level:debug memprofile: metrics-label-filter:namespace,workload,pod,binary metrics-server::2112 netns-dir:/var/run/docker/netns/ pprof-addr: process-cache-size:65536 procfs:/procRoot rb-queue-size:65535 rb-size:0 rb-size-total:0 release-pinned-bpf:true server-address:localhost:54321 tracing-policy: tracing-policy-dir:/etc/tetragon/tetragon.tp.d verbose:0]"
time="2023-11-23T20:13:50Z" level=info msg="BPF detected features: override_return: false, buildid: true, kprobe_multi: false, fmodret: false"
time="2023-11-23T20:13:50Z" level=debug msg="Detected mounted BPF filesystem at /sys/fs/bpf"
time="2023-11-23T20:13:50Z" level=info msg="BPF: successfully released pinned BPF programs and maps" bpf-dir=/sys/fs/bpf/tetragon
time="2023-11-23T20:13:50Z" level=info msg="Enabling policy filtering"
time="2023-11-23T20:13:50Z" level=info msg="BTF discovery: default kernel btf file found" btf-file=/sys/kernel/btf/vmlinux
time="2023-11-23T20:13:50Z" level=info msg="sensor controller waiting on channel"
time="2023-11-23T20:13:50Z" level=info msg="Starting metrics server" addr=":2112"
time="2023-11-23T20:13:50Z" level=info msg="Registering pod delete handler for metrics"
time="2023-11-23T20:13:50Z" level=info msg="Cgroup mode detection succeeded" cgroup.fs=/sys/fs/cgroup cgroup.mode="Unified mode (Cgroupv2)"
time="2023-11-23T20:13:50Z" level=debug msg="Cgroup BPF helpers will run in Cgroupv2 mode or fallback to raw Cgroup on errors" cgroup.fs=/sys/fs/cgroup
time="2023-11-23T20:13:50Z" level=debug msg="Cgroup available controllers" cgroup.controllers="[cpuset cpu cpuacct blkio memory devices freezer net_cls perf_event net_prio pids]" cgroup.fs=/sys/fs/cgroup
time="2023-11-23T20:13:50Z" level=info msg="Supported cgroup controller 'memory' is active on the system" cgroup.controller.hierarchyID=0 cgroup.controller.index=4 cgroup.controller.name=memory cgroup.fs=/sys/fs/cgroup
time="2023-11-23T20:13:50Z" level=info msg="Supported cgroup controller 'pids' is active on the system" cgroup.controller.hierarchyID=0 cgroup.controller.index=10 cgroup.controller.name=pids cgroup.fs=/sys/fs/cgroup
time="2023-11-23T20:13:50Z" level=info msg="Supported cgroup controller 'cpuset' is active on the system" cgroup.controller.hierarchyID=0 cgroup.controller.index=0 cgroup.controller.name=cpuset cgroup.fs=/sys/fs/cgroup
time="2023-11-23T20:13:50Z" level=info msg="Cgroupv2 supported controllers detected successfully" cgroup.controllers="[cpuset cpu io memory pids]" cgroup.fs=/sys/fs/cgroup
time="2023-11-23T20:13:50Z" level=info msg="Cgroupv2 controller 'memory' will be used as a fallback for the default hierarchy" cgroup.controller.hierarchyID=0 cgroup.controller.index=4 cgroup.controller.name=memory cgroup.fs=/sys/fs/cgroup
time="2023-11-23T20:13:50Z" level=info msg="Cgroupv2 hierarchy validated successfully" cgroup.fs=/sys/fs/cgroup cgroup.path=/sys/fs/cgroup/init.scope
time="2023-11-23T20:13:50Z" level=info msg="Deployment mode detection succeeded" cgroup.fs=/sys/fs/cgroup deployment.mode=Kubernetes
time="2023-11-23T20:13:50Z" level=info msg="Updated TetragonConf map successfully" NSPID=1 cgroup.controller.hierarchyID=0 cgroup.controller.index=4 cgroup.controller.name=memory cgroup.fs.magic=Cgroupv2 confmap-update=tg_conf_map deployment.mode=Kubernetes log.level=debug
time="2023-11-23T20:13:50Z" level=info msg="Enabling Kubernetes API"
time="2023-11-23T20:13:50Z" level=info msg="Waiting for required CRDs" crds="map[tracingpolicies.cilium.io:{} tracingpoliciesnamespaced.cilium.io:{}]"
time="2023-11-23T20:13:51Z" level=info msg="Found CRD" crd=tracingpolicies.cilium.io
time="2023-11-23T20:13:51Z" level=info msg="Found CRD" crd=tracingpoliciesnamespaced.cilium.io
time="2023-11-23T20:13:51Z" level=info msg="Found all the required CRDs"
time="2023-11-23T20:13:51Z" level=info msg="registering policyfilter pod handlers"
time="2023-11-23T20:13:51Z" level=info msg="adding /procRoot/1/root/sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice to cgroup pod directories"
time="2023-11-23T20:13:51Z" level=info msg="adding /procRoot/1/root/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice to cgroup pod directories"
time="2023-11-23T20:13:52Z" level=info msg="Initialized pod cache" num_pods=21
time="2023-11-23T20:13:52Z" level=info msg="Initialized service cache" num_services=64
time="2023-11-23T20:13:52Z" level=info msg="Initialized pod info cache" num_pod_info=0
time="2023-11-23T20:13:52Z" level=info msg="Disabling Cilium API"
time="2023-11-23T20:13:52Z" level=info msg="Starting process manager" enableCilium=false enableK8s=true enableProcessCred=false enableProcessNs=false
time="2023-11-23T20:13:52Z" level=info msg="Starting gRPC server" address="localhost:54321" protocol=tcp
time="2023-11-23T20:13:52Z" level=debug msg="Configured field filters" fieldFilters="[]"
time="2023-11-23T20:13:52Z" level=info msg="Starting JSON exporter" logger="&{/var/run/cilium/tetragon/tetragon.log 10 0 5 false false -rw------- 0 <nil> {0 0} <nil> {0 {0 0}}}" request="allow_list:{event_set:PROCESS_EXEC  event_set:PROCESS_EXIT  event_set:PROCESS_KPROBE  event_set:PROCESS_UPROBE  event_set:PROCESS_TRACEPOINT}  deny_list:{health_check:{value:true}}  deny_list:{namespace:\"\"  namespace:\"cilium\"  namespace:\"kube-system\"}  field_filters:{}"
time="2023-11-23T20:13:52Z" level=debug msg="Received a GetEvents request" events.aggregation_options="<nil>" events.allow_list="[event_set:PROCESS_EXEC  event_set:PROCESS_EXIT  event_set:PROCESS_KPROBE  event_set:PROCESS_UPROBE  event_set:PROCESS_TRACEPOINT]" events.deny_list="[health_check:{value:true} namespace:\"\"  namespace:\"cilium\"  namespace:\"kube-system\"]" events.field_filters="[]"
time="2023-11-23T20:13:52Z" level=debug msg="Adding a getEventsListener" getEventsListener="&{0x400297d080}"
time="2023-11-23T20:13:52Z" level=info msg="Exporter configuration" enabled=true fileName=/var/run/cilium/tetragon/tetragon.log
time="2023-11-23T20:13:52Z" level=debug msg="Add listener" listener="&{node1 0x4001bb4190 {0 0} map[0x4000092ea8:{}]}"
time="2023-11-23T20:13:52Z" level=info msg="Successfully detected bpftool path" bpftool=/usr/bin/bpftool
time="2023-11-23T20:13:52Z" level=info msg="BPF: found active BPF resources" bpf-dir=/sys/fs/bpf/tetragon pinned-bpf="[policy_filter_maps tg_conf_map]"
time="2023-11-23T20:13:52Z" level=info msg="BTF file: using metadata file" metadata=/sys/kernel/btf/vmlinux
time="2023-11-23T20:13:52Z" level=info msg="Loading sensor" name=__base__
time="2023-11-23T20:13:52Z" level=info msg="Loading kernel version 6.1.63"
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=bpf_exit.o
time="2023-11-23T20:13:52Z" level=debug msg="Candidate bpf file does not exist" file=bpf_exit.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_exit.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=bpf_fork.o
time="2023-11-23T20:13:52Z" level=debug msg="Candidate bpf file does not exist" file=bpf_fork.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_fork.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Candidate bpf file does not exist" file=bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=bpf_execve_bprm_commit_creds.o
time="2023-11-23T20:13:52Z" level=debug msg="Candidate bpf file does not exist" file=bpf_execve_bprm_commit_creds.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_bprm_commit_creds.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=/var/lib/tetragon/bpf_execve_bprm_commit_creds.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_bprm_commit_creds.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=/var/lib/tetragon/bpf_execve_bprm_commit_creds.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_bprm_commit_creds.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Checking for bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=debug msg="Found bpf file" file=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=info msg="Started watching tracing policies"
time="2023-11-23T20:13:52Z" level=info msg="tetragon, map loaded." map=execve_map path=/sys/fs/bpf/tetragon/execve_map sensor=__base__
time="2023-11-23T20:13:52Z" level=info msg="tetragon, map loaded." map=tg_execve_joined_info_map path=/sys/fs/bpf/tetragon/tg_execve_joined_info_map sensor=__base__
time="2023-11-23T20:13:52Z" level=info msg="tetragon, map loaded." map=execve_map_stats path=/sys/fs/bpf/tetragon/execve_map_stats sensor=__base__
time="2023-11-23T20:13:52Z" level=info msg="tetragon, map loaded." map=tg_execve_joined_info_map_stats path=/sys/fs/bpf/tetragon/tg_execve_joined_info_map_stats sensor=__base__
time="2023-11-23T20:13:52Z" level=info msg="tetragon, map loaded." map=execve_calls path=/sys/fs/bpf/tetragon/execve_calls sensor=__base__
time="2023-11-23T20:13:52Z" level=info msg="tetragon, map loaded." map=names_map path=/sys/fs/bpf/tetragon/names_map sensor=__base__
time="2023-11-23T20:13:52Z" level=info msg="tetragon, map loaded." map=tcpmon_map path=/sys/fs/bpf/tetragon/tcpmon_map sensor=__base__
time="2023-11-23T20:13:52Z" level=info msg="tetragon, map loaded." map=tg_conf_map path=/sys/fs/bpf/tetragon/tg_conf_map sensor=__base__
time="2023-11-23T20:13:52Z" level=debug msg=observerLoadInstance/var/lib/tetragon/bpf_exit.o393535 kern_version=393535 prog=/var/lib/tetragon/bpf_exit.o
time="2023-11-23T20:13:52Z" level=info msg="Loading BPF program" Program=/var/lib/tetragon/bpf_exit.o Type=kprobe
time="2023-11-23T20:13:52Z" level=debug msg="pin file for map 'exit_heap_map' not found, map is not shared!\n" prog=kprobe/acct_process
time="2023-11-23T20:13:52Z" level=debug msg="Programs (/var/lib/tetragon/bpf_exit.o):"
time="2023-11-23T20:13:52Z" level=debug msg=" - 1248: event_exit - [4556 4569 4562 4558]"
time="2023-11-23T20:13:52Z" level=debug msg="Maps (/var/lib/tetragon/bpf_exit.o):"
time="2023-11-23T20:13:52Z" level=debug msg=" - 4556: execve_map"
time="2023-11-23T20:13:52Z" level=debug msg=" - 4558: execve_map_stats"
time="2023-11-23T20:13:52Z" level=debug msg=" - 4562: tcpmon_map"
time="2023-11-23T20:13:52Z" level=debug msg=" - 4569: exit_heap_map"
time="2023-11-23T20:13:52Z" level=debug msg="BPF prog was loaded" label=kprobe/acct_process prog=/var/lib/tetragon/bpf_exit.o
time="2023-11-23T20:13:52Z" level=debug msg=observerLoadInstance/var/lib/tetragon/bpf_fork.o393535 kern_version=393535 prog=/var/lib/tetragon/bpf_fork.o
time="2023-11-23T20:13:52Z" level=info msg="Loading BPF program" Program=/var/lib/tetragon/bpf_fork.o Type=kprobe
time="2023-11-23T20:13:52Z" level=debug msg="pin file for map 'execve_val' not found, map is not shared!\n" prog=kprobe/wake_up_new_task
time="2023-11-23T20:13:52Z" level=debug msg="Programs (/var/lib/tetragon/bpf_fork.o):"
time="2023-11-23T20:13:52Z" level=debug msg=" - 1250: event_wake_up_new_task - [4556 4577 4558 4562]"
time="2023-11-23T20:13:52Z" level=debug msg="Maps (/var/lib/tetragon/bpf_fork.o):"
time="2023-11-23T20:13:52Z" level=debug msg=" - 4556: execve_map"
time="2023-11-23T20:13:52Z" level=debug msg=" - 4558: execve_map_stats"
time="2023-11-23T20:13:52Z" level=debug msg=" - 4562: tcpmon_map"
time="2023-11-23T20:13:52Z" level=debug msg=" - 4577: execve_val"
time="2023-11-23T20:13:52Z" level=debug msg="BPF prog was loaded" label=kprobe/wake_up_new_task prog=/var/lib/tetragon/bpf_fork.o
time="2023-11-23T20:13:52Z" level=debug msg=observerLoadInstance/var/lib/tetragon/bpf_execve_event_v61.o393535 kern_version=393535 prog=/var/lib/tetragon/bpf_execve_event_v61.o
time="2023-11-23T20:13:52Z" level=info msg="Loading registered BPF probe" Program=/var/lib/tetragon/bpf_execve_event_v61.o Type=execve
time="2023-11-23T20:13:52Z" level=debug msg="pin file for map 'cwd_read_v61' not found, map is not shared!\n" prog=tracepoint/sys_execve
time="2023-11-23T20:13:52Z" level=debug msg="pin file for map 'execve_msg_heap_map' not found, map is not shared!\n" prog=tracepoint/sys_execve
time="2023-11-23T20:13:52Z" level=debug msg="pin file for map 'data_heap' not found, map is not shared!\n" prog=tracepoint/sys_execve
time="2023-11-23T20:13:52Z" level=debug msg="pin file for map 'execve_heap' not found, map is not shared!\n" prog=tracepoint/sys_execve
time="2023-11-23T20:13:52Z" level=debug msg="pin file for map 'buffer_heap_map' not found, map is not shared!\n" prog=tracepoint/sys_execve
time="2023-11-23T20:13:52Z" level=debug msg="Removing a getEventsListener" getEventsListener="&{0x400297d080}"
time="2023-11-23T20:17:49Z" level=info msg="Received signal terminated, shutting down..."

Anything else?

Willing to help! Not in a hurry to get this working but I want to understand whats going wrong here and potentially help with documenting my steps for other people.

kkourt commented 9 months ago

Thanks for the report.

Are you deploying this via helm? If so, could you provide the values?

Nabsku commented 9 months ago

Thanks for the report.

Are you deploying this via helm? If so, could you provide the values?

Hi! As I wrote above, I am using the default values, just enabling the ServiceMonitor. I believe it is an issue with kernel settings but I can't trace it further down.

Running bpftool feature probe full yields the following: probe.txt

kkourt commented 9 months ago

Thanks for the report. Are you deploying this via helm? If so, could you provide the values?

Hi! As I wrote above, I am using the default values, just enabling the ServiceMonitor. I believe it is an issue with kernel settings but I can't trace it further down.

Running bpftool feature probe full yields the following: probe.txt

Does the same thing happen if the ServiceMonitor is not enabled?

I'm not seeing any kernel issues in the log. Maybe I'm missing something, but everything looks to work as expected.

The livenessProbe just ensures that the agent is running: https://github.com/cilium/tetragon/blob/d2d216d61d3e8c434b355602dd5da06f071520a7/install/kubernetes/templates/_container_tetragon.tpl#L70-L79

Not sure why it does not work though.

kkourt commented 9 months ago

Can you please also take a sysdump as described in https://tetragon.io/docs/troubleshooting/#automatic-log--state-collection? It provides some additional information compared to bugtool alone.

Nabsku commented 9 months ago

Can you please also take a sysdump as described in https://tetragon.io/docs/troubleshooting/#automatic-log--state-collection? It provides some additional information compared to bugtool alone.

I have created the sysdump - is there any place I can upload it to? Don't want to share that in public. :) Installing via Helm with default values and no ServiceMonitor sadly changed nothing. I believe it has to be kernel related but I can't find a hint anywhere (another node, non-pi based, works fine).