falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.4k stars 902 forks source link

High CPU Usage when using Falco v0.28.0 and higher #2996

Closed Junia18 closed 1 month ago

Junia18 commented 10 months ago

I am encountering a critical issue while running Falco v0.36.2 as a daemonset on an Openshift cluster (OCP v4.12) with three worker nodes. The problem manifests as a substantial drop rate of events and a notable doubling of CPU usage across all three worker nodes.

This behavior is particularly concerning as it impacts the effective functioning of Falco and the overall performance of the cluster. The kernel probe is being utilized in this configuration.

Steps to Reproduce:

CPU usage before (when using Falco versions upto v0.26.2)

> oc adm top nodes
NAME                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
master0.sweden.cp.fyre.ibm.com   1577m        21%    10056Mi         69%
master1.sweden.cp.fyre.ibm.com   1301m        17%    12124Mi         83%
master2.sweden.cp.fyre.ibm.com   899m         11%    10133Mi         69%
worker0.sweden.cp.fyre.ibm.com   2022m        13%    12960Mi         42%
worker1.sweden.cp.fyre.ibm.com   2422m        15%    7887Mi          26%
worker2.sweden.cp.fyre.ibm.com   1844m        11%    11767Mi         38%

CPU usage after (when using Falco versions > v0.27.0)

> oc adm top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
master0.normandy.cp.fyre.ibm.com   2014m        26%    8543Mi          58%
master1.normandy.cp.fyre.ibm.com   1090m        14%    7409Mi          51%
master2.normandy.cp.fyre.ibm.com   1506m        20%    10557Mi         72%
worker0.normandy.cp.fyre.ibm.com   4468m        28%    10971Mi         36%
worker1.normandy.cp.fyre.ibm.com   4059m        26%    12588Mi         41%
worker2.normandy.cp.fyre.ibm.com   4043m        26%    12829Mi         42%

Logs

Fri Jan  5 09:23:35 2024: Setting metadata download max size to 100 MB
Fri Jan  5 09:23:35 2024: Setting metadata download chunk wait time to 1000 μs
Fri Jan  5 09:23:35 2024: Setting metadata download watch frequency to 1 seconds
Fri Jan  5 09:23:35 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Fri Jan  5 09:23:35 2024: +(58) syscalls (Falco's state engine set of syscalls): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, memfd_create, mount, open_by_handle_at, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, userfaultfd, vfork
Fri Jan  5 09:23:35 2024: (70) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execve, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, kill, link, linkat, memfd_create, mkdir, mkdirat, mount, open, open_by_handle_at, openat, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, rename, rmdir, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, unlink, unlinkat, userfaultfd, vfork
Fri Jan  5 09:23:35 2024: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Fri Jan  5 09:23:35 2024: Loaded event sources: syscall
Fri Jan  5 09:23:35 2024: Enabled event sources: syscall
Fri Jan  5 09:23:35 2024: Opening event source 'syscall'
Fri Jan  5 09:23:35 2024: Opening 'syscall' source with Kernel module
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:06 2024: Syscall event drop but token bucket depleted, skipping actions

Environment

Andreagit97 commented 10 months ago

ei @Junia18 thank you for reporting! I see that in Falco 0.36.2 you have the k8s metadata enrichment enabled. This unfortunately could cause huge drops... not sure which deployment method you are using, but the k8s enrichment could be disabled by removing the -k/-K flags if you are running falco binary manually, or with the following command if you are using the default falco helm chart

helm install falco falcosecurity/falco  --set collectors.kubernetes.enabled=false 

Could you disable it and retry please?

Junia18 commented 10 months ago

@Andreagit97 I have removed the -k and -K flags, but the -pk flag is still needed as we are consuming Kubernetes metadata. Despite these changes, CPU utilization remains high. We are running the Falco binary using the following command:

falco -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://$(KUBERNETES_SERVICE_HOST) --cri /var/run/crio/crio.sock -pk -v

Is there any other solution I can try?

alacuku commented 10 months ago

Hi @Junia18, could you try the new k8s-metacollector and k8smeta plugin? We are going to ship them with falco 0.37 but you can test them with falco 0.36.2. Here you can find the modified chart: https://github.com/falcosecurity/charts/pull/598

For more info please see the following issue: https://github.com/falcosecurity/falco/issues/2973

Let me know if you need any help!

Junia18 commented 10 months ago

@alacuku I was wondering if there's any information available about the anticipated release date for Falco v0.37.0. I appreciate your time and assistance.

alacuku commented 10 months ago

The expected release date for Falco 0.37 is the end of January 2024.

poiana commented 7 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Junia18 commented 7 months ago

/remove-lifecycle stale

Junia18 commented 7 months ago

@Andreagit97 Despite upgrading to Falco v0.37.1, we continue to experience high CPU usage, accompanied by frequent syscall event drops in the logs.

CPU usage after upgrading Falco to v0.37.1

➜ oc adm top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
master0.brussels.cp.fyre.ibm.com   1458m        19%    11325Mi         78%       
master1.brussels.cp.fyre.ibm.com   1123m        14%    11305Mi         78%       
master2.brussels.cp.fyre.ibm.com   1851m        24%    9500Mi          65%       
worker0.brussels.cp.fyre.ibm.com   6779m        43%    15696Mi         51%       
worker1.brussels.cp.fyre.ibm.com   5537m        35%    17222Mi         56%       
worker2.brussels.cp.fyre.ibm.com   4758m        30%    14445Mi         47%

Logs

Tue Apr 16 14:51:20 2024: Falco version: 0.37.1 (x86_64)
Tue Apr 16 14:51:20 2024: CLI args: /usr/bin/falco -c /config/falco_config.yaml --cri /var/run/crio/crio.sock -pk -v
Tue Apr 16 14:51:20 2024: Falco initialized with configuration file: /config/falco_config.yaml
Tue Apr 16 14:51:20 2024: System info: Linux version 5.14.0-284.55.1.el9_2.x86_64 (mockbuild@x86-64-02.build.eng.rdu2.redhat.com) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-37.el9) #1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024
Tue Apr 16 14:51:20 2024: Configured rules filenames:
Tue Apr 16 14:51:20 2024: /config/falco_rules.yaml
Tue Apr 16 14:51:20 2024: Loading rules from file /config/falco_rules.yaml
Tue Apr 16 14:51:20 2024: Watching file '/config/falco_config.yaml'
Tue Apr 16 14:51:20 2024: Watching file '/config/falco_rules.yaml'
Tue Apr 16 14:51:20 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Tue Apr 16 14:51:20 2024: +(58) syscalls (Falco's state engine set of syscalls): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, memfd_create, mount, open_by_handle_at, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, userfaultfd, vfork
Tue Apr 16 14:51:20 2024: (70) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execve, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, kill, link, linkat, memfd_create, mkdir, mkdirat, mount, open, open_by_handle_at, openat, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, rename, rmdir, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, unlink, unlinkat, userfaultfd, vfork
Tue Apr 16 14:51:20 2024: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Tue Apr 16 14:51:20 2024: Loaded event sources: syscall
Tue Apr 16 14:51:20 2024: Enabled event sources: syscall
Tue Apr 16 14:51:20 2024: Opening event source 'syscall'
Tue Apr 16 14:51:20 2024: Opening 'syscall' source with Kernel module
Tue Apr 16 14:54:01 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:04 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:06 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:07 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:08 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:55:02 2024: Syscall event drop but token bucket depleted, skipping actions

Running the Falco binary using the following command: falco", "-c", "/config/falco_config.yaml", "--cri", "/var/run/docker.sock", "-pk", "-v"

New Environment:

Junia18 commented 6 months ago

@Andreagit97 Just a gentle reminder that I am waiting for your input on this issue for over two weeks now. Any updates or insights you can provide would be greatly appreciated. Thanks!

incertum commented 6 months ago

Hi 👋 @Junia18 happy to also help since Andrea is constantly working on many areas. Between v0.28 and now a lot changed.

Recommending to follow our Troubleshooting guides https://falco.org/docs/troubleshooting/dropping/ and explore the base_syscalls option and other optimizations.

In your original comment I see that you are showing metrics from different clusters https://github.com/falcosecurity/falco/issues/2996#issue-2069963392 -- unfortunately it's hard to make truly fair comparisons and looking at the CPU usages it does not appear to be a huge delta. Perhaps also checkout this paragraph https://falco.org/docs/metrics/performance/#server-load-and-falco-event-drops stating that Falco's usage is never constant. In case you are interested in a more fair comparison: would you consider running the different Falco versions on the exact same nodes plus also export our internal metrics (https://falco.org/docs/metrics/falco-metrics/) just so we can check if event rates etc were similar.

Please let us know if the guides are helpful.

incertum commented 6 months ago

^ small correction: https://github.com/falcosecurity/falco/issues/2996#issuecomment-2096769592 the internal metrics didn't yet exist in v0.28 so we can't make the more detailed comparisons.

Andreagit97 commented 6 months ago

ei @Junia18 as suggested by @incertum i would use the base_syscal.repairl option.

Looking at your actual situation, you are tracing 70 syscalls

Tue Apr 16 14:51:20 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Tue Apr 16 14:51:20 2024: +(58) syscalls (Falco's state engine set of syscalls): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, memfd_create, mount, open_by_handle_at, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, userfaultfd, vfork
Tue Apr 16 14:51:20 2024: (70) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execve, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, kill, link, linkat, memfd_create, mkdir, mkdirat, mount, open, open_by_handle_at, openat, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, rename, rmdir, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, unlink, unlinkat, userfaultfd, vfork

With base_syscall.repait=true you should obtain something like

Tue May  7 11:12:06 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Tue May  7 11:12:06 2024: +(18) syscalls (Falco's state engine set of syscalls): capset, chdir, chroot, clone, clone3, close, execveat, fchdir, fork, prctl, procexit, setgid, setpgid, setresgid, setresuid, setsid, setuid, vfork
Tue May  7 11:12:06 2024: (30) syscalls selected in total (final set): capset, chdir, chroot, clone, clone3, close, execve, execveat, fchdir, fork, kill, link, linkat, mkdir, mkdirat, open, openat, prctl, procexit, rename, rmdir, setgid, setpgid, setresgid, setresuid, setsid, setuid, unlink, unlinkat, vfork

It's enough to provide Falco with -o base_syscalls.repair=true or changing it directly in the config. This should improve the drop situation.

poiana commented 3 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 2 months ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

johannespostler commented 2 months ago

We experience the same issue on a number of Ubuntu 22.04 VMs using falco 0.37.1.

poiana commented 1 month ago

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community. /close

poiana commented 1 month ago

@poiana: Closing this issue.

In response to [this](https://github.com/falcosecurity/falco/issues/2996#issuecomment-2410691718): >Rotten issues close after 30d of inactivity. > >Reopen the issue with `/reopen`. > >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Provide feedback via https://github.com/falcosecurity/community. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.