Verify behavior under high event loss

rafaeldtinoco commented 1 year ago

Description

Tracee takes too long to "recover" from a big amount of lost events. If tracee is running, not loosing events, and there is a 1 minute peak, for example, of high throughput of events, that makes lost events count to raise, then tracee takes a very long time to recover to previous processing.

Example:

$ sudo rm -rf /tmp/tracee/out/pcap && sudo ./dist/tracee --pprof --metrics --log debug --install-path /tmp/tracee --output option:parse-arguments --capabilities bypass=false --output none --trace event=net_packet_ipv4

When tracing for net_packet_ipv4 (which might cause a very high throughput of events) in a 10Gbit test, tracee is capable of sustaining ~50k events/sec with no event loss:

And the USER CPU consumption is at ~1.5 cores (from 24):

But, by the time a second 10Gbit test is started, lots of events start to be lost:

And the behavior for tracee to recover is weird (should be checked if we can make it recover faster). Yellow is the lost events and green is total events:

And CPU consumption raises during lost events:

Output of `tracee -v`:

Tracee version "v0.11.0-38-gf467b61e"

Output of `uname -a`:

Linux fujitsu 5.19.0-29-generic #30-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 4 12:14:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional details

This is a pre-release test and I wanted to document for further investigation.

rafaeldtinoco commented 1 year ago

@NDStrahilevitz FYI

NDStrahilevitz commented 1 year ago

Possibly naive idea: by tracking event loss rate we could flush the event buffer once when hitting some threshold, this should mitigate the event loss if the cause is processing overload (which might be the case here IMO).

rafaeldtinoco commented 1 year ago

Possibly naive idea: by tracking event loss rate we could flush the event buffer once when hitting some threshold, this should mitigate the event loss if the cause is processing overload (which might be the case here IMO).

Yep, I thought something similar. Some kind of threshold for flushes.

Thinking about page cache and dirty ratio algorithm in the kernel, it has 2 watermarks: a % of page cache filled (to start flushing in the background, and when too high it would even stop the tasks to do I/O on behalf of the kernel) AND when some time that has passed from the last I/O time (it would flush it after X milliseconds).

We could use a similar approach: adding watermarks to stop receiving events from kernel until we flush the pipeline enough AND then a big one to flush everything and restart over. Or something similar. Even if that is just an initial approach to the issue.

NDStrahilevitz commented 1 year ago

I'm thinking some struct receiving our counter.Counter and basically observing it and internally triggering behavior based on the counter watermarks (we could actually use a generic Read[T numeric]() T interface instead of just the counter).

That would be a rather generic solution we can maybe use on other counters as well for different logic (tracking event rate as well, lost writes, logs etc.). Perhaps we should even move the loss notification logic there, instead of reporting on each loss.

AlonZivony commented 1 year ago

And the behavior for tracee to recover is weird (should be checked if we can make it recover faster). Yellow is the lost events and green is total events:

When I look at this graph it makes a lot of sense to me. Why isn't this expected? We can see that the amount of generated events is staying almost the same, but we loose many events. It seems that we just passed the peak of events Tracee can handle in this environment, which is probably around 50k or so in this environment. We can expect that when loosing events, Tracee is still losing resources because of the processing of the lost events in the kernel as well as tracking the lost events. So the reduction in the passed events is logical. How flushing events will help in this case? The amount of events after the burst is still too high to not loose events from what I understand from the graph. So flushing won't help solving it in this case. If something in my logic is missing please correct me :)

rafaeldtinoco commented 1 year ago

We can expect that when loosing events, Tracee is still losing resources because of the processing of the lost events in the kernel as well as tracking the lost events. So the reduction in the passed events is logical.

Hum, giving a second thought to this, I believe you might be right. To check if tracee was trashed because of full pipeline I would have to have stopped generating the same rate of events (reduce it on 1/2 after the buffer is full) and THEN check if the lost events would be reduced (tending to zero).

Yep, I believe that is correct. Closing the issue for now. Thanks @AlonZivony for the comment and making us revisit this. Cheers!

NDStrahilevitz commented 1 year ago

We can expect that when loosing events, Tracee is still losing resources because of the processing of the lost events in the kernel as well as tracking the lost events. So the reduction in the passed events is logical.

Hum, giving a second thought to this, I believe you might be right. To check if tracee was trashed because of full pipeline I would have to have stopped generating the same rate of events (reduce it on 1/2 after the buffer is full) and THEN check if the lost events would be reduced (tending to zero).

Yep, I believe that is correct. Closing the issue for now. Thanks @AlonZivony for the comment and making us revisit this. Cheers!

But lost events are triggered not due to event rate but due to the buffer filling. The green graph ~would be the derivative of the buffer filling~ is our event consumption rate (we actually don't know what our generation rate would be, we could add a BPF map to track it) where it starts at 0 and once reaching X (which we set) we start dropping events (which deteriorates both lost event processing and event consumption causing the max generation rate we can sustain to lower). If we flush the buffer we reset it back to 0, so the same generation rate would lead us to the same scenario after the same amount of time, but we would not be losing events for a while, and if the rate does lower we will be able to go back to sustaining it faster.

rafaeldtinoco commented 1 year ago

If we flush the buffer we reset it back to 0, so the same generation rate would lead us to the same scenario after the same amount of time, but we would not be losing events for a while, and if the rate does lower we will be able to go back to sustaining it faster.

Wouldn't we just be changing something like:

To something like:

I'm afraid we would have to integrate those functions to see which one process more events when lim f(x) is oo. =oD

AlonZivony commented 1 year ago

We can expect that when loosing events, Tracee is still losing resources because of the processing of the lost events in the kernel as well as tracking the lost events. So the reduction in the passed events is logical.

Hum, giving a second thought to this, I believe you might be right. To check if tracee was trashed because of full pipeline I would have to have stopped generating the same rate of events (reduce it on 1/2 after the buffer is full) and THEN check if the lost events would be reduced (tending to zero). Yep, I believe that is correct. Closing the issue for now. Thanks @AlonZivony for the comment and making us revisit this. Cheers!

But lost events are triggered not due to event rate but due to the buffer filling. The green graph ~would be the derivative of the buffer filling~ is our event consumption rate (we actually don't know what our generation rate would be, we could add a BPF map to track it) where it starts at 0 and once reaching X (which we set) we start dropping events (which deteriorates both lost event processing and event consumption causing the max generation rate we can sustain to lower). If we flush the buffer we reset it back to 0, so the same generation rate would lead us to the same scenario after the same amount of time, but we would not be losing events for a while, and if the rate does lower we will be able to go back to sustaining it faster.

I am not sure if this will help. The lost events is probably causing some added resource consumption, but I believe it should be minor. I believe that probably the change was that more of the machine resources were dedicated to the kernel to generate more events, causing the maximum event consumption to be reduced because of decrease of free resources. You are right that if the the lost events indeed increase the load be major factor, it might help. In that case, @rafaeldtinoco I think that if the event generation stays faster than the consumption, there shouldn't be a decrease of event consumption between the peaks. But this is helpful only if processing lost events is really harmful to performance in a major degree.

rafaeldtinoco commented 1 year ago

In that case, @rafaeldtinoco I think that if the event generation stays faster than the consumption, there shouldn't be a decrease of event consumption between the peaks. But this is helpful only if processing lost events is really harmful to performance in a major degree.

Agreed. It wasn't the case here as the machine had more resources and the perfbuffer was simply losing events due to lack of userland (scheduled task) consumption of it (due to X reasons like not being parallel and scaling more, because of our pipeline processing, etc, etc).

The decrease from the peaks I mentioned (triangles) would be because we would be temporarily flush the pipeline so the lost events would be reduced and increase if the producer of events had the same workload.

But, like I said, can't say that is better than keeping a constant of lost events without doing anything.. hard to tell.

aquasecurity / tracee