aquasecurity / tracee

Linux Runtime Security and Forensics using eBPF
https://aquasecurity.github.io/tracee/latest
Apache License 2.0
3.63k stars 419 forks source link

Drop events/rules with low priority on high load #1343

Open yanivagman opened 2 years ago

yanivagman commented 2 years ago

When system load is high, we might be required to drop some events/rules. Currently we don't have a mechanism to prioritize events/rules, neither a mechanism to reduce load consumed by tracee-ebpf and tracee-rules. To improve system performance on high loads, the following can be implemented:

  1. Add priority field for each event
  2. Add priority field for each rule
  3. In tracee-ebpf, update a bpf map (use already existing chosen_events map?) with events to drop that are of low priority when required
  4. On rules engine in tracee-rules, don't send events to rules with low priority when required
  5. Implement load monitoring in tracee-ebpf
  6. Implement load monitoring in tracee-rules
  7. Expose an API to set events/rules priority
  8. Expose an API to provide statistics of tracee-ebpf and tracee-rules dropped events/rules
itaysk commented 2 years ago

On events pipeline in tracee-ebpf, drop events with low priority when required

that might be too late, if we're defining the desired solution here, I think we want to drop is in eBPF (shoud_trace)

On rules engine in tracee-rules, don't send events to rules with low priority when required

isn't this redundant if we're dropping the events in tracee-ebpf?

Implement load monitoring in tracee-ebpf Implement load monitoring in tracee-rules Expose an API to provide statistics of tracee-ebpf and tracee-rules dropped events/rules

related to #887

Expose an API to set events/rules priority

related to #636

yanivagman commented 2 years ago

On events pipeline in tracee-ebpf, drop events with low priority when required

that might be too late, if we're defining the desired solution here, I think we want to drop is in eBPF (shoud_trace)

Yes, I was just about to update this issue with the following suggestion: set bpf map with the events to drop and drop in bpf code.

On rules engine in tracee-rules, don't send events to rules with low priority when required

isn't this redundant if we're dropping the events in tracee-ebpf?

No. There might be rules that use events with high importance, for example execve, yet the rule itself might not be that important

Implement load monitoring in tracee-ebpf Implement load monitoring in tracee-rules Expose an API to provide statistics of tracee-ebpf and tracee-rules dropped events/rules

related to #887

Expose an API to set events/rules priority

related to #636

mtcherni95 commented 2 years ago

Yes, I was just about to update this issue with the following suggestion: set bpf map with the events to drop and drop in bpf code.

So what you are suggesting here is to create a new bpf map should_drop , and if an event is defined there then we won't call events_perf_submit?

yanivagman commented 2 years ago

Yes, I was just about to update this issue with the following suggestion: set bpf map with the events to drop and drop in bpf code.

So what you are suggesting here is to create a new bpf map should_drop , and if an event is defined there then we won't call events_perf_submit?

Maybe that won't be necessary if we will use the already existing chosen_events map

mtcherni95 commented 2 years ago

Sounds good. I am just concerning regarding concurrency. I believe we should start implementing synchronization mechanics in our maps from user space at least.

mtcherni95 commented 2 years ago

After taking it with @yanivagman a first approach would be to:

  1. Add priority to events (for example a value from 1 to 5)
  2. Save a state of current minimal priority (5 will be default and will be considered as minimal threshold)
  3. Expose API DropLoad which will: go over the events in chosen_events and remove them if their priority is greater than the minimal one. Then update the state of current minimal priority being the old one minus 1.

The API of DropLoad can be used manually and/or in future we can have a self-healing mechanics that will get statistics from monitoring engine (open telemetry e.g.) and if the system is overwhelmed then automatically tracee-ebpf will call DropLoad. Something similar can be done with tracee-rules.

WDYT? @itaysk

itaysk commented 2 years ago

SGTM, a couple of suggestions:

we need to be able to keep track of what events the user chose (which is what chosen events originally meant to do) in addition to what events we actually trace (may change due to implicit events, or now overload). we should be able to always refer back to what the user originally asked.

the api IMO should take a target threshold instead of decrement. I'd suggest SetPriorityThreshold(int)

yanivagman commented 2 years ago

we need to be able to keep track of what events the user chose (which is what chosen events originally meant to do) in addition to what events we actually trace (may change due to implicit events, or now overload). we should be able to always refer back to what the user originally asked.

This is true, but remember that we already keep track of what events the user chose by t.eventsToTrace in userspace. chosen_events bpf map was indeed equal to this userspace map (for entries with value set to true), but the intention was to avoid sending irrelevant events to userspace. So actually, there is no need for the bpf code to know which events were chosen by the user, but which events are required to be submitted to the perf buffer. So we might want to rename this bpf map to something like events_to_submit and then it will be clear what is the purpose of this map.

rafaeldtinoco commented 2 years ago

@NDStrahilevitz this is one issue you should keep track of (for the major 'filtering improvement' effort you're handling).