Falco 0.32.1: `Runtime error: failed to open event raw_syscalls/sys_enter. Exiting.`

PhilipSchmid commented 2 years ago

Hi all

Describe the bug

We tried upgrading from Falco 0.30.0 (Helm chart: 1.16.3) to 0.32.1 (Helm chart: 2.0.4). Unfortunately, the new Falco pods are now crashing, when eBPF is used (with Kernel module it works just fine): Runtime error: failed to open event raw_syscalls/sys_enter. Exiting.

Startup of Falco 0.30.0 (was always working):

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.30.0, driver version=3aa7a83bf7b9e6229a3824e3fd1f4452d1e95cb4
* Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
* Mounting debugfs
* Trying to download a prebuilt eBPF probe from https://download.falco.org/driver/3aa7a83bf7b9e6229a3824e3fd1f4452d1e95cb4/falco_ubuntu-generic_4.15.0-176-generic_18
5.o
curl: (6) Could not resolve host: download.falco.org
Unable to find a prebuilt falco eBPF probe
* Trying to compile the eBPF probe (falco_ubuntu-generic_4.15.0-176-generic_185.o)
* eBPF probe located in /root/.falco/falco_ubuntu-generic_4.15.0-176-generic_185.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
2022-06-15T14:06:25+0000: Falco version 0.30.0 (driver version 3aa7a83bf7b9e6229a3824e3fd1f4452d1e95cb4)
2022-06-15T14:06:25+0000: Falco initialized with configuration file /etc/falco/falco.yaml
2022-06-15T14:06:25+0000: Loading rules from file /etc/falco/default_rules/falco_rules.yaml:
2022-06-15T14:06:25+0000: Loading rules from file /etc/falco/default_k8s_rules/k8s_audit_rules.yaml:
....
2022-06-15T14:06:36+0000: Starting internal webserver, listening on port 8765
2022-06-15T14:06:36+0000: gRPC server threadiness equals to 4
2022-06-15T14:06:36+0000: Starting gRPC server at unix:///var/run/falco/falco.sock

Startup with Falco 0.32.1 (not working) - falco container (image: falcosecurity/falco-no-driver:0.32.1):

2022-07-25T12:46:11+0000: Falco version 0.32.1
2022-07-25T12:46:11+0000: Falco initialized with configuration file /etc/falco/falco.yaml
2022-07-25T12:46:11+0000: Loading rules from file /etc/falco/default_rules/falco_rules.yaml:
....
2022-07-25T12:46:17+0000: gRPC server threadiness equals to 4
2022-07-25T12:46:17+0000: Starting internal webserver, listening on port 8765
2022-07-25T12:46:17+0000: Starting gRPC server at unix:///var/run/falco/falco.sock
2022-07-25T12:46:17+0000: Unable to load the driver.
2022-07-25T12:46:17+0000: Runtime error: failed to open event raw_syscalls/sys_enter. Exiting.
terminate called without an active exception

Startup with Falco 0.32.1 (looks kind ok I guess, at least quite similar to the old Falco 0.30.0 version) - falco-driver-loader init container (image: falcosecurity/falco-driver-loader:0.32.1):

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.32.1, driver version=2.0.0+driver
* Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
* Mounting debugfs
mount: /sys/kernel/debug: cannot mount nodev read-only.
* Trying to download a prebuilt eBPF probe from https://download.falco.org/driver/2.0.0%2Bdriver/falco_ubuntu-generic_4.15.0-188-generic_199.o
curl: (6) Could not resolve host: download.falco.org
Unable to find a prebuilt falco eBPF probe
* Trying to compile the eBPF probe (falco_ubuntu-generic_4.15.0-188-generic_199.o)
* eBPF probe located in /root/.falco/falco_ubuntu-generic_4.15.0-188-generic_199.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o

How to reproduce it

Relevant Falco 0.30.0 (Helm chart: 1.16.3) values (working):

containerd:
  socket: /var/vcap/sys/run/containerd/containerd.sock
docker:
  socket: /var/vcap/sys/run/docker/docker.sock
ebpf:
  enabled: true
extraArgs:
- --disable-cri-async
image:
  pullPolicy: Always
  registry: registry.example.com/falco-project

Relevant Falco 0.32.1 (Helm chart: 2.0.4) values (not working):

controller:
  kind: daemonset
collectors:
  docker:
    enabled: true
    socket: /var/vcap/sys/run/docker/docker.sock
  containerd:
    enabled: true
    socket: /var/vcap/sys/run/containerd/containerd.sock
  crio:
    enabled: false
driver:
  kind: ebpf
  ebpf:
    leastPrivileged: false
  loader:
    enabled: true
    initContainer:
      enabled: true
      image:
        pullPolicy: Always
        registry: registry.example.com/falco-project
extra:
  args:
  - --disable-cri-async
image:
  pullPolicy: Always
  registry: registry.example.com/falco-project

Expected behaviour

Falco version 0.32.1 should start normally with eBPF enabled just like it did in version 0.30.0.

Environment

Falco version: 0.32.1
K8s version: 1.23.7
K8s distribution: TKGI (1.14.1-build.16)
OS: Ubuntu 16.04.7 LTS
Kernel: 4.15.0-188-generic
Installation method: Helm chart

Additional information

A quick ls -la /root/.falco on the regarding Falco pod shows that the /root/.falco/falco-bpf.o symlink and /root/.falco/falco_ubuntu-generic_4.15.0-188-generic_199.o mount, created from the falco-driver-loader initContainer, are there:

$ k exec -it falco-syscall-49hqg -- ls -la /root/.falco
Defaulted container "falco" out of: falco, falco-driver-loader (init)
total 3952
drwxrwxrwx 2 root root    4096 Jul 25 13:42 .
drwx------ 1 root root    4096 Jul 25 13:42 ..
lrwxrwxrwx 1 root root      58 Jul 25 13:42 falco-bpf.o -> /root/.falco/falco_ubuntu-generic_4.15.0-188-generic_199.o
-rw-r--r-- 1 root root 4037808 Jul 25 13:42 falco_ubuntu-generic_4.15.0-188-generic_199.o

What I also noticed so far in the Pod spec is, that both Falco Pod specs have FALCO_BPF_PROBE set to "": The old Falco 0.30.0 Pod spec:

    env:
    - name: FALCO_K8S_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: FALCO_BPF_PROBE
    - name: TZ
      value: utc

The new Falco 0.32.1 Pod spec:

    env:
    - name: FALCO_K8S_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: SKIP_DRIVER_LOADER
    - name: FALCO_BPF_PROBE

Interesting is when I set driver.ebpf.path to /root/.falco/falco-bpf.o, the environment variable FALCO_BPF_PROBE gets this value assigned, but the Falco Pod still crashes with the same error as mentioned above and the falco-driver-loader initContainer still shows the same log.

Also, when I'm overriding the default image falcosecurity/falco-no-driver with falcosecurity/falco, the Falco Pod then starts (the initContainer falco-driver-loader still shows the same log):

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.32.1, driver version=2.0.0+driver
* Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
* Mounting debugfs
Detected an unsupported target system, please get in touch with the Falco community
2022-07-25T13:25:26+0000: Falco version 0.32.1
2022-07-25T13:25:26+0000: Falco initialized with configuration file /etc/falco/falco.yaml
2022-07-25T13:25:26+0000: Loading rules from file /etc/falco/default_rules/falco_rules.yaml:
....
2022-07-25T13:25:32+0000: gRPC server threadiness equals to 4
2022-07-25T13:25:32+0000: Starting internal webserver, listening on port 8765
2022-07-25T13:25:32+0000: Starting gRPC server at unix:///var/run/falco/falco.sock

In this last scenario it does not make any difference if I'm specifying driver.ebpf.path or not.

Thanks & regards, Philip

Edit: The falco container has securityContext set to privileged: true. So I don't think this is a permission issue.

FedeDP commented 2 years ago

Hi! Thanks for opening this issue! On which architecture are you running it? I guess x86_64 of course :)

/cc @Andreagit97

PhilipSchmid commented 2 years ago

Hi @FedeDP, yes, it's x86_64 based on VMware VMs.

FedeDP commented 2 years ago

From the above output, it seems like your kernel is not offering the sys_enter tracepoint:

2022-07-25T12:46:17+0000: Runtime error: failed to open event raw_syscalls/sys_enter. Exiting.

It should be placed in /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/ ; can you confirm it is missing?

PhilipSchmid commented 2 years ago

On the VM, this tracepoint seems to be available:

# ls -la /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/
total 0
drwxr-x--- 2 root root 0 Jul 12 13:02 .
drwxr-x--- 4 root root 0 Jul 12 13:02 ..
-rw-r--r-- 1 root root 0 Jul 12 13:02 enable
-rw-r--r-- 1 root root 0 Jul 12 13:02 filter
-r--r--r-- 1 root root 0 Jul 12 13:02 format
-r--r--r-- 1 root root 0 Jul 12 13:02 hist
-r--r--r-- 1 root root 0 Jul 12 13:02 id
-rw-r--r-- 1 root root 0 Jul 12 13:02 trigger

Nevertheless, what I just found because of your hint and the https://github.com/falcosecurity/falco/issues/1071 issue is, that debugfs was not mounted... 😅. When I add the following values to the values.yaml it magically works:

mounts:
  volumes:
  - hostPath:
      path: "/sys/kernel/debug"
    name: debugfs
  volumeMounts:
  - name: debugfs
    mountPath: "/sys/kernel/debug"

What is this debugfs exactly? For what does Falco need this? I didn't even find such a mount in the old Falco Helm chart versions. Was this once used and then removed?

If it's really important, we should probably consider creating a PR in the charts repo to extend the mounts of the Falco Pod spec to also get this volume mounted. (If you agree and tell me the conditions, I could create such a PR.)

Thanks & regards, Philip

Andreagit97 commented 2 years ago

Hi @PhilipSchmid thank you for finding and actually solving the issue :) we need debufs to load our tracepoint in the BPF probe as you can see here. For what concerns the fix, we will think a little bit on what is the best way to address this issue and we will come back to you with a solution :)

PhilipSchmid commented 2 years ago

Hi @Andreagit97, thanks for your quick response!

I've a question which just came into my mind: How has Falco with eBFP ever worked like this? I mean, your mentioned code snipped strcpy(buf, "/sys/kernel/debug/tracing/events/"); is according to git blame 4 years old and also I didn't find any mount of /sys/kernel/debug in the past Falco Helm chart version. What has actually changed? What am I missing?

Regards, Philip

Andreagit97 commented 2 years ago

Ei @PhilipSchmid, considering your initial statement:

We tried upgrading from Falco 0.30.0 (Helm chart: 1.16.3) to 0.32.1 (Helm chart: 2.0.4).

I think that the issue is due to the fact that with Falco 0.32.1 we use the falco-no-driver image as default and this doesn't mount the debugfs while with Falco 0.30.0 the default image was falcosecurity/falco and here the falco-driver-loader correctly mount the debugfs

PhilipSchmid commented 2 years ago

Ahh, got it. Thanks for the clarification. Basically, falcosecurity/falco mounted debugfsdirectly (privileged Pod) and that's why I didn't find such a mount inside the Pod spec 😄.

Thanks & regards, Philip

PhilipSchmid commented 2 years ago

Hi @Andreagit97,

You mentioned you will address this issue and think about a solution. How long do you think will it take? What's the complexity of it? Please don't get me wrong here, I don't want to stress you here, I'm basically just wondering because if I know that, we can decide internally if we should roll out the new Falco version with the mentioned workaround within the next days (which is totally fine for us) or if we should wait for your proper fix 😀.

Hi @PhilipSchmid thank you for finding and actually solving the issue :) we need debufs to load our tracepoint in the BPF probe as you can see here. For what concerns the fix, we will think a little bit on what is the best way to address this issue and we will come back to you with a solution :)

Many thanks in advance!

Regards, Philip

Andreagit97 commented 2 years ago

Hi @Andreagit97,

You mentioned you will address this issue and think about a solution. How long do you think will it take? What's the complexity of it? Please don't get me wrong here, I don't want to stress you here, I'm basically just wondering because if I know that, we can decide internally if we should roll out the new Falco version with the mentioned workaround within the next days (which is totally fine for us) or if we should wait for your proper fix grinning.

Hi @PhilipSchmid thank you for finding and actually solving the issue :) we need debufs to load our tracepoint in the BPF probe as you can see here. For what concerns the fix, we will think a little bit on what is the best way to address this issue and we will come back to you with a solution :)

Many thanks in advance!

Regards, Philip

Hey, @PhilipSchmid don't worry I understood what you mean :) If we can fix it directly in the helm chart it will be released in Falco 0.33, but right now we have still not discussed it, I will come back to you in the next few days with more info :)

P.S. Falco 0.33 will be released around the end of September :)

Andreagit97 commented 1 year ago

Hey @PhilipSchmid we have talked a little bit about possible solutions, and we agreed that your solution is the best one :point_down:

If it's really important, we should probably consider creating a PR in the charts repo to extend the mounts of the Falco Pod spec to also get this volume mounted. (If you agree and tell me the conditions, I could create such a PR.)

It would be great if you could create this PR! Thank you again for having spotted the bug!

PS: this fix will be included in Falco 0.33 :) !

falcosecurity / falco

Falco 0.32.1: `Runtime error: failed to open event raw_syscalls/sys_enter. Exiting.` #2145