Closed jasondellaluce closed 2 years ago
For (P3), I see the following feasible solutions:
As such, I think the way to go is to implement a POC of all these three and build a common benchmark to evaluate which option suits our use case.
Since we didn't have a reliable benchmark to stress the async extraction optimization, I worked on https://github.com/falcosecurity/plugin-sdk-go/pull/60 which got just merged. Now we have what we need to evaluate performance of the three options above.
Given the above, I worked on POC branches that implement all the three solutions above:
Then I ran the benchmarks/async
benchmark on each branch with different number of threads. My hardware setup was:
# sudo lshw -short
H/W path Device Class Description
====================================================
system VirtualBox
/0 bus VirtualBox
/0/0 memory 128KiB BIOS
/0/1 memory 16GiB System memory
/0/2 processor Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
...
CPU specs: Total Cores 6 Total Threads 12, Max Turbo Frequency 4.50 GHz, Processor Base Frequency 2.60 GHz, Cache 12 MB. The VM was assigned 8 cores.
The benchmark consisted in running this command, with go version go1.17.5 linux/amd64
:
benchmarks/async/build/bench -n 1000000 -p <nconsumers> [-a]
Observations:
htop
showed 100% cpu usage in every CPU involved (as expected)cgo.Handle
and uses a round-robin policy. This means that we had a fixed number of 32 locks, which justifies the higher baseline value even with 1 thread. This can definitely be optimized, however I don't expect the thread-scalability curve to change by much due to the natural technical constraints of the solution.I see it quite hard to choose which of the three solution is the best one, as each of them have tough cost/benefits properties.
As such, before choosing I would like to experiment a fourth solution that is an hybrid of (P3-S2) and (P3-S3): N consumers, N shared locks, and M async worker, with no correlation between N and M. The effectiveness of such a solution would rely on the right choice of M of course, which should ideally alway be < N and should not reach the number of cores available to the Go runtime. In fact, I suspect an ideal value for this would be runtime.NumCPU() / 2
. Will work on a POC for this too and see post the results here.
In fact, I suspect an ideal value for this would be runtime.NumCPU() / 2
If this is indeed the solution that lands, it feels like it might be a candidate for runtime throttling, e.g. if performance problems are noticed one could change M by changing the capacity of the thread pool supplying the async workers.
Also, for any of the approaches mentioned, what's the general approach to supporting these on resource-constrained systems? With arm64 support, I feel like we have to consider that now. Certain workloads might make (P3-S3)-style solutions (M async consumers) generate too much scheduler overhead for constrained arm64 users.
I know there's a max limit of 5 consumers of the driver, but should we also think about adding a hard limit on the # of event sources?
@dwindsor, all good points.
If this is indeed the solution that lands, it feels like it might be a candidate for runtime throttling, e.g. if performance problems are noticed one could change M by changing the capacity of the thread pool supplying the async workers.
I think we could make this something configurable at runtime maybe. I'm not sure what the best way to customize this would be. Maybe making M be dependent on GOMAXPROCS
instead of runtime.NumCPU()
would be a good way.
Also, for any of the approaches mentioned, what's the general approach to supporting these on resource-constrained systems? ...
That's a good question. One solution would be to stick to only 1 or few async workers in that case. At the same time, we could consider disabling the async optimization by default if we recognize constrained resources. We sort of do that right now already (async is enabled only is we have 2+ CPUs), so making this check more intelligent might be something to dig into.
I know there's a max limit of 5 consumers of the driver, but should we also think about adding a hard limit on the # of event sources?
I think this would make sense, specially for the first release of the Falco multi-evt-source feature. We know for sure that folks expect 2 event sources since it was supported in the past (k8saudit
and syscall
), but we have not yet explored realistic use cases for which 5+ event sources might be active at the same time. That would also impact CPU time even with the async optimization out of the equation.
Ok, did some homework and re-run the benchmark documented in https://github.com/falcosecurity/plugin-sdk-go/issues/62#issuecomment-1184431757, same hardware setup.
I prepared two novel implementations:
cgo.MaxHandle
). This caused the 1 worker to synchronize in round-robin with 32 locks, where the benchmarks used only up to 8 consumer. Now, the optimized version keeps track of how many plugins have been initialized, and the worker loops only on the locks actually used at runtime. This caused a huge boost in performance.The benchmark has been run with these new two implementations, with P3-S4 using an arbitrary value of M = 3. The chart does not plot P3-S1 and P3-S2 anymore, which diverged an order of magnitude and just caused visual noise:
Now, it is clearly visible that the regular C -> Go calls are no match for the async optimization in any of its forms. Interestingly, P3-S2-optimized performs quite well and gets really close to P3-S3 in both performance and scaling curve. Even better, P3-S4 is super close to P3-S3 (excluding some noise), but only using 3 workers. Again, results over 6 threads are not very meaningful here, because we lose the assumption of 1 thread per physical core.
Considering the above, I got intrigued in understanding how the value of M affects the performance of the P3-S4 approach. So, I ran the benchmark with P3-S4 at values of M varying from 1 to N, and here's the result:
Interestingly, values of M > 1 seem to perform quite similarly! Also, the hypothesis that P3-S2-optimized and P3-S3 represent the lower and upper bound of this solution is confirmed.
I think P3-S4 is the most general and flexible solution here and it's probably what we are looking for. As @dwindsor pointed out, the question from now on becomes how the value of M should be determined at runtime, and how to adapt this solution depending on the underlying hardware capacity.
Interestingly, values of M > 1 seem to perform quite similarly!
Hmm, that is interesting... I'd think that cache issues (workloads switching cores, issuing IPIs, etc) would happen here! Very cool.
P3-S4, with a default value of M=1, seems to be a good call to me. If it turns out that manipulating M on ultra-beefy systems can increase performance, we can always do it. If not, we can just leave M=1.
PS: thanks for doing this research! 🙏
I think this would make sense, specially for the first release of the Falco multi-evt-source feature. We know for sure that folks expect 2 event sources since it was supported in the past (k8saudit and syscall), but we have not yet explored realistic use cases for which 5+ event sources might be active at the same time. That would also impact CPU time even with the async optimization out of the equation.
Yeah, we've been wondering ourselves what the likelihood is of failing to attach to a Falco driver because there are already 5 active consumers. IIUC, not much chance.. 5 consumers is quite a lot it seems!
Figuring out how to make more granular limits based on hardware configuration seems like it could be messy, so I think I agree with you that hard-coding a limit makes sense!
Motivation
For point (B3) of https://github.com/falcosecurity/falco/issues/2074, we will need the Go SDK to be aware and resistant to concurrent access to some symbols of the Go SDK. This issue tracks and documents the thought process and the developments to achieve this.
The Problem
The assumptions of https://github.com/falcosecurity/falco/issues/2074 imply that a given application could run multiple
sinsp
inspectors in parallel, each in its own thread. In this model, a given plugin registered and initialized in an inspector can't be shared across multiple inspectors. However, the same plugin dynamic library is shared as a singleton across all the inspectors of the application. This leads to the conclusion that the Plugin SDK Go must be able to support multiple consumers that:ss_plugin_t
state, such asplugin_get_name()
) from multiple concurrent threads with no restrictionss_plugin_t
_In the Go SDK, this maps to the following critical points:
plugin_init
,plugin_destroy
,plugin_open
, andplugin_close
, all rely on our optimized implementation ofcgo.Handle
, which is very performant for our use case but not thread-safecgo.Handle
implementation caps the max number of handles to be used at the same time to 32plugin_extract_fields
currently relies on a single-consumer-single-worker execution model when the async extraction optimization is enabled (e.g.extract.SetAsync(true)
), which makes it thread-safe _even if different threads use differentss_plugin_t
values_Solutions
plugin_next_batch
andplugin_extract_fields
workflows. For now, this requires no intervention on the Go SDK.