[tracking] supporting concurrent consumers

falcosecurity / plugin-sdk-go

Falco plugins SDK for Go

Apache License 2.0

26 stars 17 forks source link

[tracking] supporting concurrent consumers #62

Closed jasondellaluce closed 2 years ago

jasondellaluce commented 2 years ago

Motivation

For point (B3) of https://github.com/falcosecurity/falco/issues/2074, we will need the Go SDK to be aware and resistant to concurrent access to some symbols of the Go SDK. This issue tracks and documents the thought process and the developments to achieve this.

The Problem

The assumptions of https://github.com/falcosecurity/falco/issues/2074 imply that a given application could run multiple sinsp inspectors in parallel, each in its own thread. In this model, a given plugin registered and initialized in an inspector can't be shared across multiple inspectors. However, the same plugin dynamic library is shared as a singleton across all the inspectors of the application. This leads to the conclusion that the Plugin SDK Go must be able to support multiple consumers that:

Invoke static symbols of the plugin API (the ones that do not require a ss_plugin_t state, such as plugin_get_name()) from multiple concurrent threads with no restriction
Invoke every other stateful symbol of the plugin API from multiple concurrent threads, _each with its own unique ss_plugin_t_

In the Go SDK, this maps to the following critical points:

(P1) plugin_init, plugin_destroy, plugin_open, and plugin_close, all rely on our optimized implementation of cgo.Handle, which is very performant for our use case but not thread-safe
(P2) The optimized cgo.Handle implementation caps the max number of handles to be used at the same time to 32
(P3) plugin_extract_fields currently relies on a single-consumer-single-worker execution model when the async extraction optimization is enabled (e.g. extract.SetAsync(true)), which makes it thread-safe _even if different threads use different ss_plugin_t values_

Solutions

(P1) In the context of https://github.com/falcosecurity/falco/issues/2074, we could implement the multi-evt-sources functionality by still initializing/destroying/opening/closing plugins from one single thread, and parallelize in different threads only the plugin_next_batch and plugin_extract_fields workflows. For now, this requires no intervention on the Go SDK.
(P2) This caps the max number of plugins loadable in a Falco instance. Considering N plugins loaded, and a worst case scenario where a plugin with a field extraction capability is compatible with all other plugins, we would need a number of handles N + 1. As such, the value 32 should be enough for now, as it is beyond a real Falco deployment use case. For now, this requires no intervention on the Go SDK, however we may want to increase this value in the future.
(P3) The async extraction optimization needs to support concurrent requests (2+ event sources can request extraction to the same async worker). This requires a broader discussion, which I will continue in the comments below.

jasondellaluce commented 2 years ago

For (P3), I see the following feasible solutions:

(P3-S1) - N consumers, 1 shared lock, 1 async worker
- 👍🏼 Easy to implement, this is basically an extension of what we have right now, and can easily be achieved by adding 2 more states to the shared lock
- 👎🏼 Would suffer from lock contention (see a good explanation here: https://stackoverflow.com/a/7064008), thus scaling real bad at high number of threads
(P3-S2) - N consumers, N shared locks, 1 async worker
- 👍🏼 Does not suffer from lock contention hardware issues, because each thread synchronizes with the worker its own different lock
- 👎🏼 Would suffer of starvation, because the 1 async worker will be the serialization bottleneck in which we resolve 1 extraction per time, thus defeating the parallelization effort. Also, the policy by which the worker resolves requests would also influence performance by much (e.g. round-robin, FIFO, LIFO, etc...)
(P3-S3) - N consumers, N shared locks, N async worker
- 👍🏼 Does not suffer of lock contention nor starvation and there is no sequential bottleneck. The only limit to parallelization becomes the underlying hardware
- 👎🏼 Would hit CPU usage quite hard: on high workloads, the async worker is optimized to run in busy-wait to avoid getting scheduled to sleep in order to maximize extraction performance. This would cause the CPU assigned to the worker to hit 100% usage for small portions of times. If we have N workers, this means potentially occupying N CPUs at 100% on high paralellization workloads. Hence, this would become quite expensive and would not scale well anyways when the number of parallel consumers gets close the number of hardware cores.

As such, I think the way to go is to implement a POC of all these three and build a common benchmark to evaluate which option suits our use case.

jasondellaluce commented 2 years ago

Since we didn't have a reliable benchmark to stress the async extraction optimization, I worked on https://github.com/falcosecurity/plugin-sdk-go/pull/60 which got just merged. Now we have what we need to evaluate performance of the three options above.

jasondellaluce commented 2 years ago

Given the above, I worked on POC branches that implement all the three solutions above:

Then I ran the benchmarks/async benchmark on each branch with different number of threads. My hardware setup was:

# sudo lshw -short
H/W path      Device      Class          Description
====================================================
                          system         VirtualBox
/0                        bus            VirtualBox
/0/0                      memory         128KiB BIOS
/0/1                      memory         16GiB System memory
/0/2                      processor      Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
...

CPU specs: Total Cores 6 Total Threads 12, Max Turbo Frequency 4.50 GHz, Processor Base Frequency 2.60 GHz, Cache 12 MB. The VM was assigned 8 cores.

The benchmark consisted in running this command, with go version go1.17.5 linux/amd64:

benchmarks/async/build/bench -n 1000000 -p <nconsumers> [-a]

Observations:

Without async extraction, the out-of-the-box C -> Go calls perform quite badly in general (even x15 slower with 1 consumer). However, they scale quite well with an high number of threads.
The assumptions about thread-scalability of each solution got confirmed
- (P3-S1) scales terribly, (P3-S2) is way better but the starvation problem limits its scalability, (P3-S3) scales well but htop showed 100% cpu usage in every CPU involved (as expected)
For simplicity, the (P3-S2) implementation relies on 1 lock for each possible value of cgo.Handle and uses a round-robin policy. This means that we had a fixed number of 32 locks, which justifies the higher baseline value even with 1 thread. This can definitely be optimized, however I don't expect the thread-scalability curve to change by much due to the natural technical constraints of the solution.
Given my underlying 6-core CPU, a number of threads > 6 is not reliable since it would run on hardware multi-threading (Intel Hyper-Threading in my case)
After running the benchmark outside a VM on my same machine, I ended up having very similar results

jasondellaluce commented 2 years ago

I see it quite hard to choose which of the three solution is the best one, as each of them have tough cost/benefits properties.

As such, before choosing I would like to experiment a fourth solution that is an hybrid of (P3-S2) and (P3-S3): N consumers, N shared locks, and M async worker, with no correlation between N and M. The effectiveness of such a solution would rely on the right choice of M of course, which should ideally alway be < N and should not reach the number of cores available to the Go runtime. In fact, I suspect an ideal value for this would be runtime.NumCPU() / 2. Will work on a POC for this too and see post the results here.

dwindsor commented 2 years ago

In fact, I suspect an ideal value for this would be runtime.NumCPU() / 2

If this is indeed the solution that lands, it feels like it might be a candidate for runtime throttling, e.g. if performance problems are noticed one could change M by changing the capacity of the thread pool supplying the async workers.

dwindsor commented 2 years ago

Also, for any of the approaches mentioned, what's the general approach to supporting these on resource-constrained systems? With arm64 support, I feel like we have to consider that now. Certain workloads might make (P3-S3)-style solutions (M async consumers) generate too much scheduler overhead for constrained arm64 users.

I know there's a max limit of 5 consumers of the driver, but should we also think about adding a hard limit on the # of event sources?

jasondellaluce commented 2 years ago

@dwindsor, all good points.

If this is indeed the solution that lands, it feels like it might be a candidate for runtime throttling, e.g. if performance problems are noticed one could change M by changing the capacity of the thread pool supplying the async workers.

I think we could make this something configurable at runtime maybe. I'm not sure what the best way to customize this would be. Maybe making M be dependent on GOMAXPROCS instead of runtime.NumCPU() would be a good way.

Also, for any of the approaches mentioned, what's the general approach to supporting these on resource-constrained systems? ...

That's a good question. One solution would be to stick to only 1 or few async workers in that case. At the same time, we could consider disabling the async optimization by default if we recognize constrained resources. We sort of do that right now already (async is enabled only is we have 2+ CPUs), so making this check more intelligent might be something to dig into.

I know there's a max limit of 5 consumers of the driver, but should we also think about adding a hard limit on the # of event sources?

I think this would make sense, specially for the first release of the Falco multi-evt-source feature. We know for sure that folks expect 2 event sources since it was supported in the past (k8saudit and syscall), but we have not yet explored realistic use cases for which 5+ event sources might be active at the same time. That would also impact CPU time even with the async optimization out of the equation.

jasondellaluce commented 2 years ago

Ok, did some homework and re-run the benchmark documented in https://github.com/falcosecurity/plugin-sdk-go/issues/62#issuecomment-1184431757, same hardware setup.

I prepared two novel implementations:

P3-S2-optimized (code here): my previous implementation of the N-locks-1-worker solution used to naively allocate locks up to the max number of plugins initializable (32 for now, given cgo.MaxHandle). This caused the 1 worker to synchronize in round-robin with 32 locks, where the benchmarks used only up to 8 consumer. Now, the optimized version keeps track of how many plugins have been initialized, and the worker loops only on the locks actually used at runtime. This caused a huge boost in performance.
P3-S4 (code here): this implements the N-locks-M-workers approach that we discussed in https://github.com/falcosecurity/plugin-sdk-go/issues/62#issuecomment-1184443376 and below. Performance of this should be dependent on the choice of the M value. I expect M = 1 to perform just like P3-S2-optimized, and M = N to perform like P3-S3

The benchmark has been run with these new two implementations, with P3-S4 using an arbitrary value of M = 3. The chart does not plot P3-S1 and P3-S2 anymore, which diverged an order of magnitude and just caused visual noise:

Now, it is clearly visible that the regular C -> Go calls are no match for the async optimization in any of its forms. Interestingly, P3-S2-optimized performs quite well and gets really close to P3-S3 in both performance and scaling curve. Even better, P3-S4 is super close to P3-S3 (excluding some noise), but only using 3 workers. Again, results over 6 threads are not very meaningful here, because we lose the assumption of 1 thread per physical core.

Considering the above, I got intrigued in understanding how the value of M affects the performance of the P3-S4 approach. So, I ran the benchmark with P3-S4 at values of M varying from 1 to N, and here's the result:

Interestingly, values of M > 1 seem to perform quite similarly! Also, the hypothesis that P3-S2-optimized and P3-S3 represent the lower and upper bound of this solution is confirmed.

I think P3-S4 is the most general and flexible solution here and it's probably what we are looking for. As @dwindsor pointed out, the question from now on becomes how the value of M should be determined at runtime, and how to adapt this solution depending on the underlying hardware capacity.

dwindsor commented 2 years ago

Interestingly, values of M > 1 seem to perform quite similarly!

Hmm, that is interesting... I'd think that cache issues (workloads switching cores, issuing IPIs, etc) would happen here! Very cool.

P3-S4, with a default value of M=1, seems to be a good call to me. If it turns out that manipulating M on ultra-beefy systems can increase performance, we can always do it. If not, we can just leave M=1.

PS: thanks for doing this research! 🙏

dwindsor commented 2 years ago

I think this would make sense, specially for the first release of the Falco multi-evt-source feature. We know for sure that folks expect 2 event sources since it was supported in the past (k8saudit and syscall), but we have not yet explored realistic use cases for which 5+ event sources might be active at the same time. That would also impact CPU time even with the async optimization out of the equation.

Yeah, we've been wondering ourselves what the likelihood is of failing to attach to a Falco driver because there are already 5 active consumers. IIUC, not much chance.. 5 consumers is quite a lot it seems!

Figuring out how to make more granular limits based on hardware configuration seems like it could be messy, so I think I agree with you that hard-coding a limit makes sense!