falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.4k stars 902 forks source link

[DISCUSSION] New `base_syscalls.exclude_enter_exit_set` config #2960

Open incertum opened 11 months ago

incertum commented 11 months ago

Motivation

The hardware landscape is evolving towards models with 96, 128, or more CPUs. However, Falco currently faces usability challenges on such machines, particularly those dealing with heavy traffic, especially in network and file-related activities.

One potential solution could involve allowing end users to specify a subset of enter or exit syscall events they want to drop on the kernel side. This feature would be flagged as very risky to use, similar to the existing base_syscalls feature.

For instance, users might opt to drop enter syscall events for open* and connect syscalls, even though they are aware that doing so could expose them to TOCTOU attacks (mitigated by default via this PR). Nevertheless, this trade-off might be preferable to completely disabling Falco.

Feature

Introduce a new config base_syscalls.exclude_enter_exit_set, allowing exclusion of specific enter or exit events that are part of the custom_set syscalls. This exclusion is limited to scenarios where it makes sense for enter or exit events. Ensure good documentation.

Additional context

https://github.com/falcosecurity/libs/issues/1557

CC @falcosecurity/libs-maintainers

incertum commented 11 months ago

@stevenbrz let's see if the other maintainers are on board. If yes, it could be a great "warm up" contribution for you to take on :wink:

Andreagit97 commented 11 months ago

Yes, Falco doesn't scale on these huge servers and we need to find a possible solution to mitigate this case, one idea could be:

  1. adapt our sinsp state to be only populated by exit_events, enter_events are just needed to mitigate TOCTOU or in old kernel versions.
  2. when sinsp can reconstruct the state with only exit events, we can disable all enter events informing our users that this will turn Falco into a best-effort detection mode that could be vulnerable to some attacks. I would prefer to remove all enter events to reduce complexity instead of having a sort of simple consumer just for enter events :exploding_head:. This point will halve our kernel events, and this is already a great result.
  3. With event throughputs of 20 milions/s the previous point is not enough, we will obtain 10 milions/s but Falco cannot handle it, so we need a sort of hash table in the drivers to filter exit events. My idea would be to expose some API in sinsp that allow different filters (on the comm, on the exepath, on the cmdline,...) These filters are evaluated in userspace when we read the event from the next (if we have a match we add the pid of this process inside the hash table used by the drivers so the following events will be excluded kernel side). Of course, we need to evaluate how many filters we can process because it could be quite heavy. Moreover, I would avoid filtering clone/execve/proc_exit events, we have already seen these don't cause perf overhead and we need them to keep a reliable process tree inside sinsp.

This is just an idea but maybe it could work

incertum commented 11 months ago

Moreover, I would avoid filtering clone/execve/proc_exit events, we have already seen these don't cause perf overhead and we need them to keep a reliable process tree inside sinsp.

Big +1 those aren't an issue.

cccsss01 commented 10 months ago

I'm in support of this.

leogr commented 10 months ago

I'm in favor of investigating this front :+1:

poiana commented 7 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

incertum commented 7 months ago

/remove-lifecycle stale

poiana commented 4 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 4 months ago

/remove-lifecycle stale

poiana commented 1 month ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

leogr commented 1 month ago

/remove-lifecycle stale

Andreagit97 commented 1 month ago

See https://github.com/falcosecurity/libs/pull/2068