falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.26k stars 893 forks source link

Falco 0.37.1 modern_ebpf crashes server #3181

Closed apsega closed 3 months ago

apsega commented 4 months ago

Describe the bug

After upgrading Falco from 0.36.2 to 0.37.1 and switching driver from ebpf to modern_ebpf, it causes physical server with higher load to crash.

How to reproduce it

Random behaviour over time on more loaded physical servers.

Environment

Additional context

Crashdump:

17284898.905756] IPv6: ADDRCONF(NETDEV_CHANGE): cali841dc279d4d: link becomes ready
[17285388.370981] IPv6: ADDRCONF(NETDEV_CHANGE): cali6a7f0dad2a8: link becomes ready
[17285491.259227] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[17285491.259283] IPv6: ADDRCONF(NETDEV_CHANGE): cali5d758ecb513: link becomes ready
[17285552.983963] BUG: unable to handle page fault for address: ffffffffff6000c7
[17285552.987818] #PF: supervisor read access in kernel mode
[17285552.991552] #PF: error_code(0x0000) - not-present page
[17285552.995304] PGD 6a0e067 P4D 6a0e067 PUD 6a10067 PMD 6a12067 PTE 0
[17285552.999051] Oops: 0000 [#1] PREEMPT SMP NOPTI
[17285553.002776] CPU: 31 PID: 95831 Comm: kube-proxy Kdump: loaded Not tainted 6.1.42-1.el8.x86_64 #1
[17285553.006737] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.11.4 03/22/2023
[17285553.010774] RIP: 0010:copy_from_kernel_nofault+0x6d/0x120
[17285553.014852] Code: f8 4c 89 e7 4b 8d 14 2c 31 f6 48 c1 e8 03 4d 8d 44 c4 08 eb 13 48 83 c7 08 48 89 d1 48 83 c3 08 48 29 f9 4c 39 c7 74 34 89 f1 <48> 8b 03 48 89 07 85 c9 74 e1 65 48 8b 04 25 c0 bb 01 00 83 a8 18
[17285553.023657] RSP: 0018:ffffc90003be7d80 EFLAGS: 00010256
[17285553.028208] RAX: 0000000000000000 RBX: ffffffffff6000c7 RCX: 0000000000000000
[17285553.033957] RDX: ffffc90003be7e18 RSI: 0000000000000000 RDI: ffffc90003be7e10
[17285553.038745] RBP: ffffc90003be7d98 R08: ffffc90003be7e18 R09: 0000000000000000
[17285553.043381] R10: 0000000000000001 R11: ffff88826a519990 R12: ffffc90003be7e10
[17285553.048067] R13: 0000000000000008 R14: 0000000000000000 R15: ffffc90003be7e98
[17285553.052769] FS:  000000c000d90890(0000) GS:ffff88fe7d9c0000(0000) knlGS:0000000000000000
[17285553.057962] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17285553.062947] CR2: ffffffffff6000c7 CR3: 000000153ab3c000 CR4: 0000000000350ee0
[17285553.068504] Call Trace:
[17285553.074274]  <TASK>
[17285553.079218]  ? show_regs.cold.14+0x1a/0x1f
[17285553.084320]  ? __die_body+0x1f/0x70
[17285553.089309]  ? __die+0x2a/0x35
[17285553.094284]  ? _end+0x7b5da0c7/0x0
[17285553.099340]  ? page_fault_oops+0xaf/0x270
[17285553.104379]  ? bpf_probe_read_kernel+0x1d/0x50
[17285553.109575]  ? bpf_ringbuf_submit+0x10/0x20
[17285553.115044]  ? bpf_prog_182d4293644cc965_pf_kernel+0x549/0x558
[17285553.121418]  ? _end+0x7b5da0c7/0x0
[17285553.127468]  ? do_user_addr_fault+0x30b/0x590
[17285553.132943]  ? _end+0x7b5da0c7/0x0
[17285553.138381]  ? exc_page_fault+0x6f/0x160
[17285553.143782]  ? asm_exc_page_fault+0x27/0x30
[17285553.149265]  ? _end+0x7b5da0c7/0x0
[17285553.154742]  ? copy_from_kernel_nofault+0x6d/0x120
[17285553.160220]  bpf_probe_read_kernel+0x1d/0x50
[17285553.166254]  bpf_prog_3a9838b3cf5001f5_accept4_x+0x2e6/0x1589
[17285553.172566]  ? bpf_probe_read_kernel+0x1d/0x50
[17285553.178263]  ? bpf_prog_c5b1b737d5cb01c5_sys_exit+0x28f/0x50c
[17285553.184115]  bpf_trace_run2+0x54/0xd0
[17285553.189977]  __bpf_trace_sys_exit+0x9/0x10
[17285553.195917]  syscall_exit_to_user_mode_prepare+0x171/0x1d0
[17285553.202015]  syscall_exit_to_user_mode+0xd/0x40
[17285553.207926]  do_syscall_64+0x46/0x90
[17285553.214281]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[17285553.221453] RIP: 0033:0x42130e
[17285553.228105] Code: 20 4c 89 44 24 38 e8 31 3d ff ff 48 85 f6 0f 84 97 00 00 00 48 8b 54 24 78 49 89 f1 48 8b 74 24 48 4d 89 c8 49 29 d0 4d 8b 09 <4d> 85 c9 74 b1 4d 89 ca 49 29 d1 4c 39 ce 77 a6 4c 89 44 24 70 48
[17285553.240964] RSP: 002b:000000c000e51e88 EFLAGS: 00000206
[17285553.247460] RAX: 000000c003f36f70 RBX: 00000000000000d0 RCX: 000000000002aaa0
[17285553.254109] RDX: 000000c003f36f70 RSI: 00000000000000d0 RDI: 0000000000000012
[17285553.260788] RBP: 000000c000e51f08 R08: 0000000000000018 R09: 0000000000000000
[17285553.267735] R10: 000000000002aaaa R11: 0000000000000002 R12: 000000c000e51f08
[17285553.274514] R13: 000000000000000e R14: 000000c0005c6ea0 R15: 0000000002f14f80
[17285553.280551]  </TASK>
[17285553.286380] Modules linked in: xt_CT xt_multiport ipt_rpfilter ip_set_hash_net veth ip6t_REJECT nf_reject_ipv6 nf_conntrack_netlink ipt_REJECT nf_reject_ipv4 xt_addrtype xt_set ip_set_hash_ipportnet ip_set_hash_ipport ip_set_hash_ipportip ip_set_hash_ip ip_set_bitmap_port dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr xt_MASQUERADE xt_mark nft_chain_nat nf_nat xt_conntrack xt_comment nft_compat overlay ip_vs_sed ip_vs_lc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tcp_diag inet_diag amd64_edac edac_mce_amd kvm_amd kvm irqbypass wmi_bmof pcspkr rapl nf_tables sp5100_tco acpi_ipmi i2c_piix4 k10temp nfnetlink ipmi_si acpi_power_meter vfat fat sch_fq_codel ipmi_devintf ipmi_msghandler xfs libcrc32c dm_crypt sd_mod t10_pi crc64_rocksoft crc64 crct10dif_pclmul crc32_pclmul crc32c_intel sg ghash_clmulni_intel sha512_ssse3 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci i2c_algo_bit aesni_intel drm_shmem_helper crypto_simd libahci cryptd tg3 i40e drm ptp libata ccp pps_core
[17285553.286445]  megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
[17285553.350752] CR2: ffffffffff6000c7

Installation using official helm chart version 0.4.2 with the following values:

services:
  - name: k8saudit-webhook
    type: ClusterIP
    ports:
      - port: 9765
        protocol: TCP

# -- Tolerations to allow Falco to run on Kubernetes masters.
tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane

driver:
  kind: modern_ebpf
  modernEbpf:
    bufSizePreset: 8
  loader:
    initContainer:
      resources:
        requests:
          cpu: 10m
          memory: 1Gi
        limits:
          cpu: 1000m
          memory: 1Gi

falcoctl:
  config:
    indexes:
    - name: falcosecurity
      url: https://falcosecurity.github.io/falcoctl/index.yaml
    artifact:
      allowedTypes:
        - rulesfile
        - plugin
      install:
        refs: [k8saudit-rules:0.7]
      follow:
      # -- List of artifacts to be followed by the falcoctl sidecar container.
        refs: [k8saudit-rules:0.7]
        # -- How often the tool checks for new versions of the followed artifacts.
        every: 1h

falco:
  rules_file:
    - /etc/falco/falco_rules.local.yaml
    - /etc/falco/rules.d
  json_output: true
  json_include_output_property: true
  json_include_tags_property: true
  http_output:
    enabled: true
    url: "http://falcosecurity-falcosidekick:80/"
  grpc:
    enabled: true
    bind_address: "unix:///run/falco/falco.sock"
    threadiness: 0 # 0 means "auto"
  grpc_output:
    enabled: true
  plugins:
    - name: k8saudit
      library_path: libk8saudit.so
      init_config:
        maxEventSize: "125829120"
        webhookMaxBatchSize: "125829120"
      open_params: "http://:9765/k8s-audit"
    - name: json
      library_path: libjson.so
      init_config: ""
  buffered_outputs: true
  load_plugins: [k8saudit, json]
  syscall_event_drops:
    actions:
      - ignore
    rate: "0.03333"
    max_burst: 10
  log_level: notice

resources:
  requests:
    cpu: 1
    memory: 12Gi
  limits:
    cpu: 2
    memory: 16Gi

# Collectors for data enrichment (scenario requirement)
collectors:
  docker:
    enabled: false
  crio:
    enabled: false
  kubernetes:
    enabled: false
Andreagit97 commented 4 months ago

ei @apsega thank you for reporting! we will take a look ASAP!

Andreagit97 commented 4 months ago

This https://github.com/falcosecurity/libs/pull/1858 should be the cause of the failure! we will probably release it with Falco 0.38.0 by the end of the month!

Just a question, do you see this page fault sporadically or is this something that always happens?

apsega commented 4 months ago

@Andreagit97 occasionally, probably depends on that server load.

Andreagit97 commented 4 months ago

ok got it thank you! I've seen that you have page faults ebpf programs enabled bpf_prog_182d4293644cc965_pf_kernel+. Do you use page_fault events in your rules? So something like evt.type= page_fault

I asked this because it is unusual to see page faults programs enabled, and probably these programs are generating a lot of events...

apsega commented 4 months ago

Sorry for the delay. Apparently we don't have any rules containing page_fault. Thinking if it's not misconfiguration issue from my end.

Andreagit97 commented 3 months ago

This should be solved in Falco 0.38.0!