falcosecurity / libs

libsinsp, libscap, the kernel module driver, and the eBPF driver sources
https://falcosecurity.github.io/libs/
Apache License 2.0
227 stars 162 forks source link

sinsp_suppress m_cache optimization is not working #1724

Closed albe19029 closed 6 months ago

albe19029 commented 7 months ago

Good day. Have found next issue in new sinsp_suppress class.

Starting from this commit (https://github.com/falcosecurity/libs/commit/1054418f33d027a1525ba0d8bc2a60be1ecd3afe) suppressed logic was moved from libscap to libsinsp.

After that cache logic was also moved to https://github.com/falcosecurity/libs/commit/6a9a0ec14d647ad3d73b65f9eb99a33fa1a14eb3

And regarding old comments:

// The cache around the m_suppressed_tids hash table. Why?
// In benchmarks as of June 2023, scap_check_suppressed() with
// suppressed TIDs increases speed from 17.51s to 7.24s. Why?
// Cache lookup is a single cache line fetch with no hashing.
// Avoid dynamic allocation to avoid extra cache line fetch.
scap_tid_stid m_devid_tid_stid_cache[SCAP_CACHE_DEVID_MAX];

This cache is very important. But in new C++ code I see only one place where cache_slot is set is here: cache_slot(devid) = 0;

Did you forgot to add this code in method is_suppressed_tid? As I understand logic:

return m_suppressed_tids.find(tid) != m_suppressed_tids.end();

should be changed to:

bool result = m_suppressed_tids.find(tid) != m_suppressed_tids.end();
if (result) {
  cache_slot(devid) = tid;
}
return result;

Did you run perfomance test to check that m_cache is working?

Andreagit97 commented 7 months ago

ei @albe19029 thank you for reporting! @gnosek could you take a look, please?

gnosek commented 7 months ago

Thanks for your report @albe19029! I'll take a closer look now, but please note that the comment referred to the old (C-based) hash table implementation which is generally atrocious :) I don't think I'll be able to repro the exact perf tests but I'll try to get some numbers either way.

gnosek commented 7 months ago

TL;DR the cache in sinsp_suppress does nothing but it's still many times faster than the scap suppress cache.

@albe19029, here are some results from my tests of the absolute worst case for tid suppression (10M suppressed tids, none of which match). This is probably also a bad case for the cache implementation since the events aren't coming through multiple devices, so we effectively only have a single cache slot.

hyperfine -L binary $(find -name sinsp-example-\* | tr \\n , | sed 's@,$@@') '{binary} -r -s ./scap_files/kexec_x86.scap > /dev/null'
Benchmark 1: ./libsinsp/examples/sinsp-example-suppress10m -r -s ./scap_files/kexec_x86.scap > /dev/null
  Time (mean ± σ):     792.2 ms ±  29.4 ms    [User: 546.5 ms, System: 245.5 ms]
  Range (min … max):   763.5 ms … 854.1 ms    10 runs

Benchmark 2: ./libsinsp/examples/sinsp-example-suppress10m-cache -r -s ./scap_files/kexec_x86.scap > /dev/null
  Time (mean ± σ):     776.1 ms ±  10.9 ms    [User: 537.8 ms, System: 237.8 ms]
  Range (min … max):   753.7 ms … 792.6 ms    10 runs

Benchmark 3: ./libsinsp/examples/sinsp-example-vanilla -r -s ./scap_files/kexec_x86.scap > /dev/null
  Time (mean ± σ):     296.9 ms ±   4.3 ms    [User: 261.0 ms, System: 35.7 ms]
  Range (min … max):   291.8 ms … 303.1 ms    10 runs

Benchmark 4: ./libsinsp/examples/sinsp-example-suppress10m-scap-cache -r -s ./scap_files/kexec_x86.scap > /dev/null
  Time (mean ± σ):      8.968 s ±  0.077 s    [User: 8.520 s, System: 0.447 s]
  Range (min … max):    8.864 s …  9.131 s    10 runs

Benchmark 5: ./libsinsp/examples/sinsp-example-suppress10m-scap-nocache -r -s ./scap_files/kexec_x86.scap > /dev/null
  Time (mean ± σ):      3.543 s ±  0.053 s    [User: 3.097 s, System: 0.446 s]
  Range (min … max):    3.461 s …  3.627 s    10 runs

Summary
  ./libsinsp/examples/sinsp-example-vanilla -r -s ./scap_files/kexec_x86.scap > /dev/null ran
    2.61 ± 0.05 times faster than ./libsinsp/examples/sinsp-example-suppress10m-cache -r -s ./scap_files/kexec_x86.scap > /dev/null
    2.67 ± 0.11 times faster than ./libsinsp/examples/sinsp-example-suppress10m -r -s ./scap_files/kexec_x86.scap > /dev/null
   11.93 ± 0.25 times faster than ./libsinsp/examples/sinsp-example-suppress10m-scap-nocache -r -s ./scap_files/kexec_x86.scap > /dev/null
   30.21 ± 0.51 times faster than ./libsinsp/examples/sinsp-example-suppress10m-scap-cache -r -s ./scap_files/kexec_x86.scap > /dev/null

The binaries under test are:

Patch for testing suppressed tid performance:

diff --git a/userspace/libsinsp/examples/test.cpp b/userspace/libsinsp/examples/test.cpp
index 69775ab37..7ef816e1c 100644
--- a/userspace/libsinsp/examples/test.cpp
+++ b/userspace/libsinsp/examples/test.cpp
@@ -376,8 +376,16 @@ int main(int argc, char** argv)

    open_engine(inspector, events_sc_codes);

+   std::cout << "-- Suppressing 10M tids" << std::endl;
+
+   for(uint64_t i = 0; i < 10'000'000; i++)
+   {
+       inspector.suppress_events_tid(i + 1000000ul);
+   }
+
    std::cout << "-- Start capture" << std::endl;

+
    inspector.start_capture();

    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();

In this particular run, the sinsp cache is minimally faster but this looks like measurement noise (I have also observed runs where the cache was minimally slower).

@Andreagit97: we can either commit @albe19029's patch (more or less, see below) or drop the cache support completely and simplify the code a little bit. Thoughts?

This is the patch that I applied:

diff --git a/userspace/libsinsp/sinsp_suppress.cpp b/userspace/libsinsp/sinsp_suppress.cpp
index e21d9c78b..e74b21234 100644
--- a/userspace/libsinsp/sinsp_suppress.cpp
+++ b/userspace/libsinsp/sinsp_suppress.cpp
@@ -149,7 +149,12 @@ bool libsinsp::sinsp_suppress::is_suppressed_tid(uint64_t tid, uint16_t devid) c
    {
        return true;
    }
-   return m_suppressed_tids.find(tid) != m_suppressed_tids.end();
+   bool result = m_suppressed_tids.find(tid) != m_suppressed_tids.end();
+   if(result)
+   {
+       m_cache[devid % CACHE_SIZE] = tid;
+   }
+   return result;
 }

 uint64_t& libsinsp::sinsp_suppress::cache_slot(uint16_t devid)
diff --git a/userspace/libsinsp/sinsp_suppress.h b/userspace/libsinsp/sinsp_suppress.h
index 6f1556add..ea66cf159 100644
--- a/userspace/libsinsp/sinsp_suppress.h
+++ b/userspace/libsinsp/sinsp_suppress.h
@@ -56,7 +56,7 @@ protected:
    uint64_t m_num_suppressed_events = 0;

    static constexpr size_t CACHE_SIZE = 1024;
-   uint64_t m_cache[CACHE_SIZE] {};
+   mutable uint64_t m_cache[CACHE_SIZE] {};
 };

 }
Andreagit97 commented 7 months ago

Thanks for the quick feedback! If there is almost no gain with the cache, I suggest dropping it to simplify the code...but let's see what we all think about this

albe19029 commented 7 months ago

1) Is it possible to add sinsp-example-suppress10m-scap-cache and sinsp-example-suppress10m-scap-nocache without test patch also? Just to compare them with maximum speed if there is not tids for suppress? As for me it is unclear what makes them so slow compare to modern one? Is it a hash (std::unordered_set vs uthash) or it is other logic not related to current question.

2) Very strange results about scap-nocache vs scap-cache. Like cached version is much slower. Is it possible in scap-cache version just to comment out m_devid_tid_stid_cache and run again. Just to be sure we are working on the same code base? As result really looks very strange.

gnosek commented 7 months ago
  1. Yeah, I can give it a try later today. The difference comes from the hash table implementation: uthash requires an extra allocation for each entry and the associated pointer chasing.

  2. This is testing the worst case for caching (IIRC we do not cache negative entries, so we maintain the cache in addition to checking the full set anyway) but yeah, I'm surprised at how bad it fared too. Since I'll have to rebuild the binaries for (1) anyway, I'll try with some less brutal test case too (e.g. suppress a smaller set of tids in the actually used range). Note these benchmarks won't be directly comparable to the ones I have posted, since they'll skip processing the events, so the baseline overhead won't be there.

Can you post the patch to scap-cache you have in mind? Just commenting out m_devid_tid_stid_cache is effectively reverting the scap-cache patches. Do you expect an unrelated change in libscap made this difference? I can try with a build of scap-nocache + just the cache patches (i.e. just after adding them instead of just before removing them again)

cc @simonhf FYI

albe19029 commented 7 months ago

I just mean that sinsp::next implementation was changed a lot, and it it hard to compare master branch and old one. But we are comparing suppress logic and for master I now can have the time for this logic: sinsp-example-suppress10m-cache - sinsp-example-vanilla sinsp-example-suppress10m - sinsp-example-vanilla

The same I need to get suppress time before code was moved to sincp. And it is better cached and non cached version are on the same code base (pathed with m_devid_tid_stid_cache or commented out m_devid_tid_stid_cache don't mater), just the code on the same version to be sure not other code base changes are related to time diff.

albe19029 commented 7 months ago

As I understand cache patch optimize next situation: 1) program is assigned to CPU and run a period of time until linux sheduler will not assign next process to this CPU. So for this period of time every event will be from this tid. The same will be for other CPUs. So it makes no sense to search info in hash, as tid will be the same. 2) As I understand kexec_x86.scap file working the same. There will be events from the same tid and checks will be on cache, not on hashtable. Do I understand correct?

gnosek commented 7 months ago
  1. Yes
  2. Yes, with an important difference: when doing a live capture, the events will come from different CPUs in parallel (more or less) but in a capture file, they're all read sequentially.

Consider a system with two CPUs and two syscall-heavy threads, one per CPU. In a live capture, we'll have a stream of events from tid 1 on cpu 1 (happily served by the tid cache) and a similar stream for tid 2 on cpu 2 (again, served by the tid cache).

In a scap file, these events will get interleaved and since we don't have multiple CPUs (real or abstract), we'll only have a single tid cache slot that will keep bouncing between the two tids (as I was typing this, I realized this might be wrong, since we do store the devid in the scap file, will recheck).

albe19029 commented 7 months ago

it will be good to recheck if savefile restore events with CPU information or not. Hope yes. But if not - maybe it is possible to save single threaded stream to kexec_x86.scap file to be more sure?

albe19029 commented 7 months ago

As I can see from scap_savefile.c cpu is setting.

master // EVF_BLOCK_TYPE has 32 bits of flags // pdevid = (uint16_t *)handle->m_reader_evt_buf;

6.0.1 branch // EVF_BLOCK_TYPE has 32 bits of flags // pcpuid = (uint16_t *)handle->m_reader_evt_buf;

gnosek commented 6 months ago

@albe19029, @Andreagit97

TL;DR the sinsp cache still does nothing and the results seem reproducible

Okay, I ran some more tests. Here are the results on a linear graph (for all graphs, the X axis is the number of suppressed tids, the Y axis is time in seconds to process the sample capture file).

All the lines have commit ids in their labels and come in chronological order, oldest on top, though (spoiler alert) there isn't much variation between different commits with the same implementation, i.e. the other libs changes are mostly irrelevant for this benchmark).

I ran all the binaries while suppressing tids 1 to N (the X-axis value). Note that this does match all threads in the capture eventually (except for tid 0 which appears in context switch events, which -- as it happens -- is the majority of the capture).

image

The results come in three groups:

Switching to a log-log graph, we see mostly the same story in more detail. I don't really have an explanation of why 6a9a0ec14 would be so much faster and then 5dc692e30 would slow down to exactly match 1054418f3. I'm willing to attribute this to some fluke (the stars -- or jump targets or whatever -- aligned just right for this binary).

We also see a dip in the runtime at the point where the suppressed tids actually cause us to skip events.

image

One last view, zooming in on the more realistic range. Until 10k suppressed threads, the performance of all approaches is mostly the same, but starts to diverge at 100k, where the uthash implementation starts to struggle and goes off the chart (literally, the Y axis is linear now and capped at 500ms) at about 1M tids:

image

albe19029 commented 6 months ago

It is hard for me to see detailed images. But as I understood we don't need cache anymore?

gnosek commented 6 months ago

Yes, with or without cache, the implementation in sinsp has identical performance, much higher than the uthash-based scap one (though it's hard to see in sane scenarios where you don't suppress millions of tids).

I'll do one final comparison where I'll revert the (broken) sinsp cache patch on top of current master.

Andreagit97 commented 6 months ago

I'll do one final comparison where I'll revert the (broken) sinsp cache patch on top of current master.

Thank you @gnosek for the deep analysis! Yes, according to what you reported this seems the right move!

gnosek commented 6 months ago

A graph with the values themselves was pointless here, just three lines following each other ~exactly.

This is the graph of relative speedup/slowdown (test case time / falco/master time) for the two implementations: fixed-cache is your patch, no-cache is reverting the (broken) cache implementation. The results are close enough that I'd call them measurement noise. The fixed cache does have an advantage with (unrealistically) huge numbers of suppressed tids, but is mildly slower in the midrange (100-100k suppressed tids):

image

I'll submit a revert of the cache as a PR. @albe19029, do you want to be credited in any specific way? I only have your username here.

albe19029 commented 6 months ago

"credited in any specific way" - what is it about? I get all information that I need. So thinks my question was resolved. Thanks a lot for your job.