Introduce conditional kernel-side event filtering

stevenbrz commented 11 months ago

Motivation

Hi! We're deploying Falco on large, highly utilized instances. Despite allocating an entire CPU core to Falco, we experience a high percentage of event drops. We have a high volume of nearly identical and benign events coming through on hosts that ultimately consume resources having to run through the rule evaluation pipeline.

Feature

It would be excellent to be able to specify a set of filters for dropping events in kernel-space before even getting allocated on the ring buffer. For example, a filter could ignore all exec events with a specific proc.cmdline or similarly open events with a given fd.name.

We're looking to try and make a patch supporting this, and it would be great if we could do it in such a way that it could ultimately be beneficial to the upstream.

Diving into the code, it looks like it could potentially live here where we could peek into ctx, filtering out syscalls that match patterns defined in the config. I'm not sure the best way to do this generically, but even just supporting exec* and open* would likely benefit us a lot.

Any thoughts on this approach or if there's potentially a better way to do this?

Alternatives

We've tried adjusting base_syscalls.custom_set in the config to the minimum set we need in addition to adjusting the ring buffer parameters with no perceivable improvement.

Additional context

We're running the latest 0.36.2 release on a mixture of ARM and x86 boxes running CentOS Stream and AlmaLinux 9 with kernel versions 5.15 and later.

incertum commented 11 months ago

Thanks @stevenbrz. In fact, I've been thinking about pushing more functionality like filtering into the kernel driver ever since I started contributing. However, reality set in quickly ...

Here are some challenges to be aware of:

Check out our recently added kernel testing framework proposal here. It highlights that anything you do in the kernel driver happens in the application context, so yes, you can slow down all your apps, which SREs won't like. That's why we still have to find the right balance. We're all painfully aware that pushing all events up to userspace won't scale, but you also don't want to go crazy in the driver either.

Maintaining our drivers and ensuring they remain compatible across our extensive kernel support matrix is a significant burden. The eBPF verifier rejecting probes is one of the most frustrating experiences for adopters, and this PR for example highlights this pain point.

We've tried adjusting base_syscalls.custom_set in the config to the minimum set we need in addition to adjusting the ring buffer parameters with no perceivable improvement.

I suppose you tried different configs. Out of curiosity, was just monitoring fork/execve* syscalls already a problem?

Do you use Falco with very fine-tuned rules or are you looking to use it for generous data collection (aka noisy rules, lots of outputs)? Asking because this could also cause increased back pressure.

Do your servers have 96+ CPUs?

We're looking to try and make a patch supporting this, and it would be great if we could do it in such a way that it could ultimately be beneficial to the upstream.

Amazing! Looking forward to it! By the way for such a significant new feature we typically open proposals. It would also be great to quantify performance improvements using an MVP.

I'd prioritize filtering fd.name by prefixes or similar first. I'd be surprised if just monitoring fork/execve syscalls causes performance issues. On that note, be aware of the sophisticated userspace state engine featuring a thread table / process cache so that we can traverse the process tree and more. Therefore, I wouldn't recommend voluntarily dropping fork/execve syscalls.

We have a high volume of nearly identical and benign events coming through on hosts that ultimately consume resources having to run through the rule evaluation pipeline.

Tangentially, I responded yesterday to this issue https://github.com/falcosecurity/rules/issues/196 which is also around event counting and benign events / anomaly detection.

stevenbrz commented 11 months ago

Hi @incertum, thanks for the quick reply!

Check out our recently added kernel testing framework proposal here. It highlights that anything you do in the kernel driver happens in the application context, so yes, you can slow down all your apps, which SREs won't like. That's why we still have to find the right balance. We're all painfully aware that pushing all events up to userspace won't scale, but you also don't want to go crazy in the driver either.

Yeah it's a tricky balance. If this feature is disabled by default and remains an advanced/experimental setting (with these tradeoffs documented), could it be easier to argue its inclusion?

I suppose you tried different configs. Out of curiosity, was just monitoring fork/execve* syscalls already a problem?

fork, exec, and open comprise the vast majority of relevent syscalls on our systems so that's why I wanted to target filtering most of the noise from those.

Do you use Falco with very fine-tuned rules or are you looking to use it for generous data collection (aka noisy rules, lots of outputs)? Asking because this could also cause increased back pressure.

We use the new base set of rules included in the 2.0.0 release. But we also trialed running an experiment with only a single dummy rule and still experienced drops indicating that it could be related to the state engine record keeping you mentioned.

Do your servers have 96+ CPUs?

Yeah, we have Falco deployed on servers with up to 128 CPUs.

On that note, be aware of the sophisticated userspace state engine featuring a thread table / process cache so that we can traverse the process tree and more. Therefore, I wouldn't recommend voluntarily dropping fork/execve* syscalls.

We'd be willing to accept some missing process lineage/metadata stemming from filtering the noisier syscalls if it meant lowering our drop rate of potentially useful signal. This is assuming the state engine can handle cleaning out stale data if events are filtered.

incertum commented 11 months ago

Yeah it's a tricky balance. If this feature is disabled by default and remains an advanced/experimental setting (with these tradeoffs documented), could it be easier to argue its inclusion?

yes you see throughout the project that newer features are typically disabled by default. However, the maintenance burden and making sure the eBPF verifier does not complain would still be there. Btw there are other discussions around allowing attaching custom probes. Perhaps that could be a path forward for this as well? Details TBD in a proposal down the road.

fork, exec, and open comprise the vast majority of relevant syscalls on our systems so that's why I wanted to target filtering most of the noise from those.

If you could try just with the absolute minimum fork/exec* related ones and nuke all other ones and report back if that at least works I would appreciate it 😉 thanks in advance!

We had similar discussions in the past (96 CPU machines), for example checkout https://github.com/falcosecurity/falco/issues/2296#issuecomment-1467069249. Did you try testing such a matrix? Of course now it's base_syscalls.custom_set not base_syscalls for the array of syscalls names. Process exit events are always enabled for state cleanup. Dry-run with -o "log_level=debug" -o "log_stderr=true" --dry-run to print the final set of syscalls that are enabled.

up to 128 CPUs.

huh ok here is the problem .... I think you are the first adopter I know of who tries Falco on such massive servers. libscap scans each CPU in the next call because Falco requires events to be time ordered in userspace. Almost certain that's the bottleneck that puts back pressure on the probe in your settings. Someone is looking into it already, but it's very tricky.

incertum commented 11 months ago

@stevenbrz just occurred to me: Could you try using the modern BPF driver instead (you use kernel 5.15 and are therefore eligible for it)? Try increasing the number of CPUs per ring buffer to 4 or 6? modern_bpf.cpus_for_each_syscall_buffer in falco.yaml. It says DEPRECATED in the master branch, but it will only be replaced with a new string config name.

As per a conversation we had with the kernel eBPF experts there should be no contention concerns kernel side. The default is 2 CPUs for each buffer for the modern BPF driver since the kernel is accounting memory wrongly (twice) for the new BPF ring buffer compared to the older perf buffer.

stevenbrz commented 11 months ago

So I tried each set of syscalls - only adding the previous set to the next instead of replacing it outright. Surprisingly no drops until the third iteration! I also tried setting the CPUs per ring buffer from 2 to 6 and increased the buffer size by 2x in the last round, but that didn't appear to make a difference. And as for the rules, the following is the only file I included:

- macro: spawned_process
  condition: (evt.type in (execve, execveat) and evt.dir=< and proc.name=iShouldNeverAlert)

- rule: TEST Simple Spawned Process
  desc: >
    Test base_syscalls config option, ref https://hackmd.io/-nwsFyySTEKsjmjGHCyPRg?view
  enabled: true
  condition: >
    spawned_process
  output: |
    command line: %proc.cmdline
  priority: WARNING

https://gist.github.com/stevenbrz/69391aa71b22d205b6add88ae10fb905

incertum commented 11 months ago

:exploding_head: scap event rates are over 21 Million / second !!! The libbpf stats (which you hadn't turned on) tracepoint invocation rates must be off the roof.

I hate to break it, but Falco cannot handle this (yet).

Falco can probably handle scap event rates of up to 100K / second with acceptable drop rates. And anything below 60-70K / second should likely be no problem / no drops.

Your servers have low "process spawn rates", but are very network and file open heavy. Typically file opens are the biggest problem AFAIK.

This may not be the advice you are hoping for, but you may want to consider cutting your losses (for now) and at least perform security monitoring around spawned processes and then crawl and find solutions to add other syscalls.

For example, because of the TOCTOU attack vector we expanded the monitoring scope to enter events (see https://github.com/falcosecurity/libs/pull/235/files). You could manually cut that in half if it's an acceptable risk (push empty params or revert those changes all together). Wrt network you could also try only pushing TCP up and skip any other traffic.

However, with your event rates you may really need very aggressive ip and file paths prefix filtering kernel side (be aware of the socket or bind syscalls iinter-dependencies for some network related syscalls). Maybe try to hard-code such an approach to see if there may be hope. If you search for "memfd:" in the code base you will find example eBPF code around string comparisons. And I forgot to answer one of your previous questions: Add these patches into the respective "fillers", those are the tail called programs that process the respective event type and push the extracted params onto the buffer.

stevenbrz commented 11 months ago

This is all great information, thank you! I'll look into what you said about enter events.

Add these patches into the respective "fillers", those are the tail called programs that process the respective event type and push the extracted params onto the buffer.

So just so I understand correctly, you mean for example I could patch logic into fill_event_open_x to look for some prefix in the filename and on a match, abort the function early?

incertum commented 11 months ago

This is all great information, thank you! I'll look into what you said about enter events.

You should be able to return and drop out early before making the enter tail call. userspace should be resilient to missing events.

Add these patches into the respective "fillers", those are the tail called programs that process the respective event type and push the extracted params onto the buffer.

So just so I understand correctly, you mean for example I could patch logic into fill_event_open_x to look for some prefix in the filename and on a match, abort the function early?

That's what I would try. Or maybe for early testing try the reverse, that is, only push onto the buffer on a match, e.g. /etc dir. Be aware that in the kernel you don't have fd.name you only have the raw arg, which resembles more fs.path.nameraw, because even for fd.nameraw we sanitize possible consecutive /. For fd.name relative paths are resolved to absolute paths in userspace using the state engine (e.g. the process' current working directory + open* raw arg).

Tagging some of our eBPF experts who may have additional advice @Andreagit97 @FedeDP :wink:

Very curious to see if there is any hope for such beefy servers :upside_down_face: thanks for reaching out!

b1shan commented 11 months ago

FWIW, we have been experiencing similar issues on our busiest servers. We are in the process of evaluating Tetragon. Primarily for its in-kernel filtering capabilities. HTH.

stevenbrz commented 10 months ago

I've thrown together a patch that allows you to specify a set of filters in the config with each filter consisting of the syscall number, arg number for the string to filter on, and finally a set of filter prefixes e.g.

filters:
  - syscall: 257 # syscall number for `openat` on `x86_64` 
    arg: 1 # file path
    prefixes: ["/proc", "/sys/fs/cgroup"]

It appears to work well on one of our higher load systems reducing memory usage by ~60% and the drop rate by ~30% only filtering on openat. Hoping to test this widely in the coming week.

FedeDP commented 10 months ago

That's amazing! For which driver? All of them? Anyway, thanks, waiting for more data!

stevenbrz commented 10 months ago

Ah sorry, I only implemented it for the modern-ebpf driver.

incertum commented 9 months ago

prefixes: ["/proc", "/sys/fs/cgroup"]

Just acknowledging that kernel-side (where we only have the raw arg) we can have //////proc or path traversals that's what I would do as an attacker to circumvent it right away ... as the Linux kernel happily still opens the file I want that way. Same if I just cd into a directory and open a file without providing the absolute path. In userspace we are more robust against such tricks and/or we can look at both the raw arg and the resolved paths.

Still thinking ... not sure yet what we could do about that to support most use cases ...

higher load systems reducing memory usage by ~60% and the drop rate by ~30% only

This is extremely valuable feedback, thanks so much for that! Hmmm same thinking ...

stevenbrz commented 9 months ago

Just acknowledging that kernel-side (where we only have the raw arg) we can have //////proc or path traversals that's what I would do as an attacker to circumvent it right away

Yeah great catch!

So if the concern is for an attacker to abuse the filters to, for example, open /tmp/evil by passing /proc/../tmp/evil, we can scan the path for .. before filtering and allow the event through if any are found. As for the //////proc example, that would just be letting through paths that we'd want to filter. Not ideal, but not introducing a vulnerability either, right?

Another attack vector I can think of is that you wouldn't want to filter any directories that an attacker has write access to, since then they could either place their payload under that directory or symlink from it. Given that falco.yaml has the same visibility on most systems as the rules files, it seems like an equivalent amount of risk since the former can also be used to circumvent detection.

incertum commented 9 months ago

Looking back at first finding out if Falco can be usable when attempting to monitor files on your system, what would you think about the following approach?

The kernel driver has some sampling logic (not used in Falco! Used in the original sysdig tool for system diagnosis purposes). Not sure if this is for you, but I would first like to find out what the "kernel event rate" on your system is that Falco can handle. We could find this out by adjusting the sampling in experiments, but I know it's gonna be some work to perform all these tests. WDYT?

After that we could come back and see how aggressive possible kernel side filtering would need to be or we find out there are more problems elsewhere.

stevenbrz commented 9 months ago

Hi @incertum, sorry for the late response. I agree it would be incredibly valuable to know the event rate Falco can support. I assume it would be pretty uniform across host types given the event processor is single-threaded.

Further, I think building some sort of benchmark to measure this across Falco driver versions or even rulesets would be nice to measure the effectiveness of the various tweaks we want to test.

Will try to throw something together soon - still working on testing and iterating on my filtering patch at the moment.

incertum commented 9 months ago

Thank you @stevenbrz - we now have a new repo https://github.com/falcosecurity/cncf-green-review-testing for benchmarking purposes. We are still developing it - also on the CNCF side. We would love your contributions to help shape these efforts. We call it "kernel event rate" for now - just have a suspicion that it's not just the pure event rate, but it could also have to do with the nature of events and bursts. Lastly another opportunity would be to help us shape some of the new guides: https://falco.org/docs/troubleshooting/dropping/

stevenbrz commented 7 months ago

Hi, we were able to test our patch more widely, so here are some of our results:

For event rate over the course of ~4 days, here's what we have (aggregating over scap.evts_rate_sec:

count      1022425.000000 (number of metrics payloads)
mean         78169.513609
std          56479.322696
min              0.000000
50%          71634.525317
90%         139310.344902
99%         260659.990578
99.9%       565055.582845
99.99%      734383.889699
99.999%     779872.462636
max         824091.991662

Now for drop rate (percentages from scap.n_drops_perc):

Falco version 0.36.2:

count      1022425.000000
mean             0.282009
std              3.849407
min              0.000000
50%              0.000000
90%              0.000016
99%              1.781471
99.9%           77.758152
99.99%          94.549621
99.999%         97.712998
max             99.290935

Falco Version 0.37.0 with our kernel filter filtering out the following path prefixes for file open events: ["/proc", "/mnt/hadoop-yarn" , "/usr/lib/locale", "/usr/lib/hadoop", "/opt/iptables18-static", "/sys/fs/cgroup"]

count      104192.000000
mean            0.472717
std             2.702165
min             0.000000
50%             0.000000
90%             0.593645
99%             9.937187
99.9%          46.086308
99.99%         71.471973
99.999%        76.030033
max            76.692583

It looks like we get drastically better worst-case performance at the cost of slightly worst 90-99th performance which is a win for us. There's another variable of the switch between 0.36.2->0.37.0, but it's likely not too significant looking at the changelog.

incertum commented 7 months ago

Thank you very much for sharing these updates @stevenbrz. I wrote it earlier and still believe that kernel-side filtering needs to be part of Falco's future (one way or another), while of course we still have to find the best way(s) of doing it and it's going to be opt-in only for sure.

@falcosecurity/libs-maintainers proposing to move ahead with a formal proposal under https://github.com/falcosecurity/libs/tree/master/proposals to discuss details and timelines more concretely? WDYT @leogr?Realistically such a new feature will take at least 2 releases from proposal to first implementation.

leogr commented 7 months ago

@stevenbrz

Falco Version 0.37.0 with our kernel filter filtering out

Is your patch publicly available?

@incertum

@falcosecurity/libs-maintainers proposing to move ahead with a formal proposal under https://github.com/falcosecurity/libs/tree/master/proposals to discuss details and timelines more concretely? WDYT @leogr?Realistically such a new feature will take at least 2 releases from proposal to first implementation.

We can still discuss this thread, but discussing on a proposal draft would work as well. I'd be very curious to learn about the current PoC, first.

stevenbrz commented 7 months ago

Is your patch publicly available?

Sure, you can find the diff for falco here and falcosecurity-libs here. It's pretty rough around the edges, but seems to do the trick on our systems in the meantime before we can get a more polished solution.

incertum commented 7 months ago

@stevenbrz thanks for sharing the patches. Shall we tackle the proposal after the Falco 0.38.0 release (end of May)?

This would give all maintainers a bit more time to take a look at the current patch and comment on here a bit more. At the same time please feel free to already go ahead and open the proposal as @leogr already posted.

[Just FYI: I'll be out quite a bit the upcoming weeks, I'll take a much closer look end of May.]

stevenbrz commented 7 months ago

Sure, I can work on opening a proposal summarizing the feature in the coming days.

poiana commented 4 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

FedeDP commented 4 months ago

/remove-lifecycle stale

poiana commented 1 month ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

leogr commented 4 weeks ago

/remove-lifecycle stale

falcosecurity / libs

Introduce conditional kernel-side event filtering #1557