falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.3k stars 897 forks source link

[Legacy eBPF] GKE libscap: bpf_load_program() Operation not permitted #2964

Closed Maximebb closed 9 months ago

Maximebb commented 9 months ago

Describe the bug

We are running falco on GKE clusters, deployed through the helm chart. We've been running it successfully since last week, when all nodes were patched to the latest patch for 1.25 (1.25.13-gke.200). Since then, all pods are failing due to a permission issue:

2023-12-11T10:15:46-05:00   Mon Dec 11 15:15:46 2023: Falco version: 0.36.2 (x86_64)
2023-12-11T10:15:46-05:00   Mon Dec 11 15:15:46 2023: Falco initialized with configuration file: /etc/falco/falco.yaml
2023-12-11T10:15:46-05:00   Mon Dec 11 15:15:46 2023: Loading rules from file /etc/falco/falco_rules.yaml
2023-12-11T10:15:46-05:00   Mon Dec 11 15:15:46 2023: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
2023-12-11T10:15:46-05:00   Mon Dec 11 15:15:46 2023: Starting health webserver with threadiness 4, listening on port 8765
2023-12-11T10:15:46-05:00   Mon Dec 11 15:15:46 2023: Loaded event sources: syscall
2023-12-11T10:15:46-05:00   Mon Dec 11 15:15:46 2023: Enabled event sources: syscall
2023-12-11T10:15:46-05:00   Mon Dec 11 15:15:46 2023: Opening 'syscall' source with BPF probe. BPF probe path: /root/.falco/falco-bpf.o
2023-12-11T10:15:47-05:00   -- BEGIN PROG LOAD LOG --
2023-12-11T10:15:47-05:00   processed 43798 insns (limit 1000000) max_states_per_insn 1 total_states 4061 peak_states 4061 mark_read 1921
2023-12-11T10:15:47-05:00   
2023-12-11T10:15:47-05:00   -- END PROG LOAD LOG --
2023-12-11T10:15:47-05:00   Mon Dec 11 15:15:47 2023: An error occurred in an event source, forcing termination...
2023-12-11T10:15:47-05:00   Events detected: 0
2023-12-11T10:15:47-05:00   Rule counts by severity:
2023-12-11T10:15:47-05:00   Triggered rules by rule name:
2023-12-11T10:15:47-05:00   Error: libscap: bpf_load_program() event=raw_tracepoint/filler/sys_procexit_e: Operation not permitted

How to reproduce it

  1. Deploy a GKE cluster using 1.25.13-gke.200
  2. Deploy falco using the helm chart using these values:
    priorityClassName: system-node-critical
    driver:
    enabled: true
    kind: ebpf
    falcosidekick:
    config:
    customfields: <...>
    slack:
      minimumpriority: warning # emergency|alert|critical|error|warning|notice|informational|debug
      webhookurl: <...>
    enabled: true
    tty: true

Expected behaviour

We expected falco to be able to run in a privileged context. We confirmed the proc inside the container has the expected capabilities documented here.

Environment

values.yaml
---
priorityClassName: system-node-critical
driver:
  enabled: true
  kind: ebpf
falcosidekick:
  config:
    customfields: <...>
    slack:
      minimumpriority: warning # emergency|alert|critical|error|warning|notice|informational|debug
      webhookurl: <...>
  enabled: true
tty: true

Troubleshooting What we've done, since this is supposed to be in a privileged context, is check the proc capabilities in case some were not available to kubernetes, somehow.

~ $ cat /proc/1/status | grep Cap
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

It seems to be granting ample permissions, since this decodes to

~ $ sudo capsh --decode=000001ffffffffff
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore

I can spot cap_sys_ptrace, cap_sys_resource, cap_bpf and cap_perfmon

Maximebb commented 9 months ago

Maybe to add as context, I considered moving to the modern-ebpf driver instead, but it fails at the driver loader looking to download the pre-built module at https://download.falco.org/driver/6.0.1+driver/x86_64/falco_cos_5.15.120+_1.ko (404 not found)

Maximebb commented 9 months ago

One last piece of info: I downgraded a lab environment to test out the previous version. I noticed the upgrade didn't change the OS version, but it did bump up the kernel version from 5.15.107 to 5.15.120.

Previous working nodes

~ $ uname -a
Linux <...> 5.15.107+ #1 SMP Thu Jun 29 09:19:06 UTC 2023 x86_64 AMD EPYC 7B12 AuthenticAMD GNU/Linux
~ $
~ $
~ $ cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_METRICS_PRODUCT_ID=26
KERNEL_COMMIT_ID=b15e582c1dbbf0e6f06747082754e5c5a71ea426
GOOGLE_CRASH_ID=Lakitu
VERSION=101
VERSION_ID=101
BUILD_ID=17162.210.48

Upgraded non-functional nodes

~ $ uname -a
Linux <...> 5.15.120+ #1 SMP Sat Aug 19 09:23:05 UTC 2023 x86_64 AMD EPYC 7B13 AuthenticAMD GNU/Linux
~ $
~ $
~ $ cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
KERNEL_COMMIT_ID=ca9810d05350e5d91be95056f0e5a75dd8e727ac
GOOGLE_CRASH_ID=Lakitu
GOOGLE_METRICS_PRODUCT_ID=26
VERSION=101
VERSION_ID=101
BUILD_ID=17162.279.24
Andreagit97 commented 9 months ago

uhm maybe this issue could help https://github.com/falcosecurity/falco/issues/2874. I see that the verifier error is the same

`-- BEGIN PROG LOAD LOG --
processed 43798 insns (limit 1000000) max_states_per_insn 1 total_states 4061 peak_states 4061 mark_read 1921

-- END PROG LOAD LOG --
Mon Oct 16 09:06:37 2023: An error occurred in an event source, forcing termination...
Mon Oct 16 09:06:37 2023: Closing event source 'syscall'
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:
Error: libscap: bpf_load_program() event=raw_tracepoint/filler/sys_procexit_e: Operation not permitted`

The modern probe should work out of the box as reported here https://github.com/falcosecurity/falco/issues/2874#issuecomment-1771234766

Maybe to add as context, I considered moving to the modern-ebpf driver instead, but it fails at the driver loader looking to download the pre-built module at https://download.falco.org/driver/6.0.1+driver/x86_64/falco_cos_5.15.120+_1.ko (404 not found)

This is strange the modern probe doesn't use the driver-loader ... these are the changes required to run the modern-bpf https://github.com/falcosecurity/falco/issues/2874#issuecomment-1771234766

Maximebb commented 9 months ago

I feel bad, I definitely entered the modern-bpf with a typo (extra e). It works with the modern driver perfectly. I suspect the behavior with a typo was to default to the kernel module.

I'm actually unblocked on my end, but I'll let you decide whether to keep the issue opened to track the legacy epbf driver issue. I did a light reading on kernel release notes and 5.15.111 had a couple of ebpf related changes. I suspect that was the version that introduced a breaking change.

Andreagit97 commented 9 months ago

I feel bad, I definitely entered the modern-bpf with a typo (extra e). It works with the modern driver perfectly. I suspect the behavior with a typo was to default to the kernel module.

Yeah don't worry this is a common error, we are working on renaming it for the next release to improve the user experience!

I'm actually unblocked on my end, but I'll let you decide whether to keep the issue opened to track the legacy epbf driver issue. I did a light reading on kernel release notes and 5.15.111 had a couple of ebpf related changes. I suspect that was the version that introduced a breaking change.

Yes unfortunately this is a known issue we are aware of, having a probe compatible with all kernel versions is really hard, btw we will see if we can fix this. Since we are already tracking the verifier issue here https://github.com/falcosecurity/libs/issues/1521, i will close this one if it is ok for you! Feel free to reopen if you have other issues related to this