falcosecurity / plugin-sdk-go

Falco plugins SDK for Go
Apache License 2.0
26 stars 17 forks source link

[tracking] supporting concurrent consumers #62

Closed jasondellaluce closed 2 years ago

jasondellaluce commented 2 years ago

Motivation

For point (B3) of https://github.com/falcosecurity/falco/issues/2074, we will need the Go SDK to be aware and resistant to concurrent access to some symbols of the Go SDK. This issue tracks and documents the thought process and the developments to achieve this.

The Problem

The assumptions of https://github.com/falcosecurity/falco/issues/2074 imply that a given application could run multiple sinsp inspectors in parallel, each in its own thread. In this model, a given plugin registered and initialized in an inspector can't be shared across multiple inspectors. However, the same plugin dynamic library is shared as a singleton across all the inspectors of the application. This leads to the conclusion that the Plugin SDK Go must be able to support multiple consumers that:

In the Go SDK, this maps to the following critical points:

Solutions

jasondellaluce commented 2 years ago

For (P3), I see the following feasible solutions:

As such, I think the way to go is to implement a POC of all these three and build a common benchmark to evaluate which option suits our use case.

jasondellaluce commented 2 years ago

Since we didn't have a reliable benchmark to stress the async extraction optimization, I worked on https://github.com/falcosecurity/plugin-sdk-go/pull/60 which got just merged. Now we have what we need to evaluate performance of the three options above.

jasondellaluce commented 2 years ago

Given the above, I worked on POC branches that implement all the three solutions above:

Then I ran the benchmarks/async benchmark on each branch with different number of threads. My hardware setup was:

# sudo lshw -short
H/W path      Device      Class          Description
====================================================
                          system         VirtualBox
/0                        bus            VirtualBox
/0/0                      memory         128KiB BIOS
/0/1                      memory         16GiB System memory
/0/2                      processor      Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
...

CPU specs: Total Cores 6 Total Threads 12, Max Turbo Frequency 4.50 GHz, Processor Base Frequency 2.60 GHz, Cache 12 MB. The VM was assigned 8 cores.

The benchmark consisted in running this command, with go version go1.17.5 linux/amd64:

benchmarks/async/build/bench -n 1000000 -p <nconsumers> [-a]
Screenshot 2022-07-14 at 15 03 54

Observations:

jasondellaluce commented 2 years ago

I see it quite hard to choose which of the three solution is the best one, as each of them have tough cost/benefits properties.

As such, before choosing I would like to experiment a fourth solution that is an hybrid of (P3-S2) and (P3-S3): N consumers, N shared locks, and M async worker, with no correlation between N and M. The effectiveness of such a solution would rely on the right choice of M of course, which should ideally alway be < N and should not reach the number of cores available to the Go runtime. In fact, I suspect an ideal value for this would be runtime.NumCPU() / 2. Will work on a POC for this too and see post the results here.

dwindsor commented 2 years ago

In fact, I suspect an ideal value for this would be runtime.NumCPU() / 2

If this is indeed the solution that lands, it feels like it might be a candidate for runtime throttling, e.g. if performance problems are noticed one could change M by changing the capacity of the thread pool supplying the async workers.

dwindsor commented 2 years ago

Also, for any of the approaches mentioned, what's the general approach to supporting these on resource-constrained systems? With arm64 support, I feel like we have to consider that now. Certain workloads might make (P3-S3)-style solutions (M async consumers) generate too much scheduler overhead for constrained arm64 users.

I know there's a max limit of 5 consumers of the driver, but should we also think about adding a hard limit on the # of event sources?

jasondellaluce commented 2 years ago

@dwindsor, all good points.

If this is indeed the solution that lands, it feels like it might be a candidate for runtime throttling, e.g. if performance problems are noticed one could change M by changing the capacity of the thread pool supplying the async workers.

I think we could make this something configurable at runtime maybe. I'm not sure what the best way to customize this would be. Maybe making M be dependent on GOMAXPROCS instead of runtime.NumCPU() would be a good way.

Also, for any of the approaches mentioned, what's the general approach to supporting these on resource-constrained systems? ...

That's a good question. One solution would be to stick to only 1 or few async workers in that case. At the same time, we could consider disabling the async optimization by default if we recognize constrained resources. We sort of do that right now already (async is enabled only is we have 2+ CPUs), so making this check more intelligent might be something to dig into.

I know there's a max limit of 5 consumers of the driver, but should we also think about adding a hard limit on the # of event sources?

I think this would make sense, specially for the first release of the Falco multi-evt-source feature. We know for sure that folks expect 2 event sources since it was supported in the past (k8saudit and syscall), but we have not yet explored realistic use cases for which 5+ event sources might be active at the same time. That would also impact CPU time even with the async optimization out of the equation.

jasondellaluce commented 2 years ago

Ok, did some homework and re-run the benchmark documented in https://github.com/falcosecurity/plugin-sdk-go/issues/62#issuecomment-1184431757, same hardware setup.

I prepared two novel implementations:

The benchmark has been run with these new two implementations, with P3-S4 using an arbitrary value of M = 3. The chart does not plot P3-S1 and P3-S2 anymore, which diverged an order of magnitude and just caused visual noise:

Screenshot 2022-07-15 at 19 14 05

Now, it is clearly visible that the regular C -> Go calls are no match for the async optimization in any of its forms. Interestingly, P3-S2-optimized performs quite well and gets really close to P3-S3 in both performance and scaling curve. Even better, P3-S4 is super close to P3-S3 (excluding some noise), but only using 3 workers. Again, results over 6 threads are not very meaningful here, because we lose the assumption of 1 thread per physical core.

Considering the above, I got intrigued in understanding how the value of M affects the performance of the P3-S4 approach. So, I ran the benchmark with P3-S4 at values of M varying from 1 to N, and here's the result:

Screenshot 2022-07-15 at 19 14 19

Interestingly, values of M > 1 seem to perform quite similarly! Also, the hypothesis that P3-S2-optimized and P3-S3 represent the lower and upper bound of this solution is confirmed.

I think P3-S4 is the most general and flexible solution here and it's probably what we are looking for. As @dwindsor pointed out, the question from now on becomes how the value of M should be determined at runtime, and how to adapt this solution depending on the underlying hardware capacity.

dwindsor commented 2 years ago

Interestingly, values of M > 1 seem to perform quite similarly!

Hmm, that is interesting... I'd think that cache issues (workloads switching cores, issuing IPIs, etc) would happen here! Very cool.

P3-S4, with a default value of M=1, seems to be a good call to me. If it turns out that manipulating M on ultra-beefy systems can increase performance, we can always do it. If not, we can just leave M=1.

PS: thanks for doing this research! 🙏

dwindsor commented 2 years ago

I think this would make sense, specially for the first release of the Falco multi-evt-source feature. We know for sure that folks expect 2 event sources since it was supported in the past (k8saudit and syscall), but we have not yet explored realistic use cases for which 5+ event sources might be active at the same time. That would also impact CPU time even with the async optimization out of the equation.

Yeah, we've been wondering ourselves what the likelihood is of failing to attach to a Falco driver because there are already 5 active consumers. IIUC, not much chance.. 5 consumers is quite a lot it seems!

Figuring out how to make more granular limits based on hardware configuration seems like it could be messy, so I think I agree with you that hard-coding a limit makes sense!