[DECISION/VOTE]: Clarify acceptability of a BPF probe-centric approach for custom driver management

falcosecurity / libs

libsinsp, libscap, the kernel module driver, and the eBPF driver sources

https://falcosecurity.github.io/libs/

Apache License 2.0

231 stars 164 forks source link

[DECISION/VOTE]: Clarify acceptability of a BPF probe-centric approach for custom driver management #1527

Closed incertum closed 4 months ago

incertum commented 11 months ago

@erthalion comment: https://github.com/falcosecurity/libs/issues/1376#issuecomment-1822397681

The way I see it, we still need to agree on one important point before moving forward: I think even a "less workaround" solution will imply having more responsibility on the bpf probe (e.g. about negotiation which progs to attach where), because otherwise chances are low that the implementation will be flexible enough. My takeaway from the discussion is that folks are strongly against that, and to not waste anyone's time we need to clarify if such approach (well thought and thoroughly developed) is not going to be frowned upon.

@Stringy comment: https://github.com/falcosecurity/libs/issues/1376#issuecomment-1822777817

agree with @erthalion that a proper solution will most likely involve shifting some ownership of a custom driver's internals back to the "user" (provided the schema is followed).

@falcosecurity/libs-maintainers, please share your stance on this proposal. +1 indicates your support for enhancing custom driver management for libs adopters. The possible use case for Falco is not yet relevant.

Remember that libs serves as the foundation not only for Falco but also for other tools built upon it. Libs only adopters are crucial to the success of The Falco Project. Supporting and assisting libs only adopters in overcoming challenges that ultimately align with Falco's best interests benefits the entire community.

incertum commented 11 months ago

+1 I'm in favor, as I wasn't particularly convinced by hiding driver management away during the last major SC refactor. Flexibility will allow for faster innovation. While I envision a highly "intelligent" driver in the future, I also believe that end users should have some degree of control. For example, the base_syscalls option is functioning exceptionally well and enables adopters to tailor Falco to their specific use cases and environments.

incertum commented 11 months ago

@Molter73 since you are out I assume +1 for you as well since you opened the proposal in the first place.

gnosek commented 11 months ago

Hmm but what is it that we're actually discussing here?

Switching from sys_enter/sys_exit to per-syscall tracepoints?
Adding some infrastructure so that the above can be done in a fork without maintaining a patch to the loader?
Adding the same infrastructure to piggyback on the existing bpf probe to load unrelated stuff?

For (3), does the unrelated stuff involve scap events in any way (existing schema or extensions) or does it simply use the libs as a convenient loader?

... or something else entirely?

erthalion commented 11 months ago

Hmm but what is it that we're actually discussing here?

Switching from sys_enter/sys_exit to per-syscall tracepoints?

Adding some infrastructure so that the above can be done in a fork without maintaining a patch to the loader?

Adding the same infrastructure to piggyback on the existing bpf probe to load unrelated stuff?

It's close the second point, except the discussion is not about adding anything particular, but rather defining an overall mindset for the future development. A BPF probe-centric approach was possible until recently, when it was tightened up to allow userspace control more precisely what the probe is doing.

In the PR referenced here, we're planning to make probe management more flexible, so it's important to clarify if the proposed change will not be rejected in favour of having more tight control. As soon as this "mindset" question is resolved, everything else is on us.

gnosek commented 11 months ago

My personal opinion is that if the loader changes are needed only for a forked version of libs, they should live in the fork if possible. Please note that this does not mean I'm against eventually upstreaming your changes, but I'm simply not clear on what we accomplish by merging just the loader bits. You still have a whole fork to maintain, right? Are the loader bits more problematic for whatever reason? Do you want to use upstream falco (libs) with a custom probe?

erthalion commented 11 months ago

You still have a whole fork to maintain, right?

Nope, we were actually half way through dropping the fork in favour of the upstream, when we found out it's not possible due to changes under our feet.

Are the loader bits more problematic for whatever reason? Do you want to use upstream falco (libs) with a custom probe?

Yes, the loaded (for both bpf & modern_probe) is very opinionated about how the probe should work. At the same time indeed it's the only piece of puzzle that prevents anyone, including us, from using a custom probe.

incertum commented 11 months ago

My personal opinion is that if the loader changes are needed only for a forked version of libs, they should live in the fork if possible. Please note that this does not mean I'm against eventually upstreaming your changes, but I'm simply not clear on what we accomplish by merging just the loader bits.

@gnosek I actually understand this completely. Another way to look at it is that The Falco Project officially supports and benefits from adopters who build tools using libs. From my perspective, this makes such requests first-class citizens as well. And there will be a commitment to iterate and polish everything. With this in mind, does the justification seem improved from your perspective?

gnosek commented 11 months ago

Nope, we were actually half way through dropping the fork in favour of the upstream, when we found out it's not possible due to changes under our feet.

Hmm not sure I understand how that's possible. You do need a custom probe (that's the entire issue) so you do have to maintain it somewhere, right? I'm genuinely trying to understand.

Yes, the loaded (for both bpf & modern_probe) is very opinionated about how the probe should work. At the same time indeed it's the only piece of puzzle that prevents anyone, including us, from using a custom probe.

I feel that it has to be opinionated, or at the very least, the opinion has to live somewhere (we don't want to yolo attach all tracepoints we find). With a custom probe, the default loader has no idea what to do with the extra sections and I have a feeling they'd always be a second class citizen (the default loader wouldn't know when to attach a tracepoint).

The Falco Project officially supports and benefits from adopters who build tools using libs. From my perspective, this makes such requests first-class citizens as well. And there will be a commitment to iterate and polish everything. With this in mind, does the justification seem improved from your perspective?

Yup, this is a very good point, though I still feel that these custom probes should either get upstreamed eventually or just keep on living in their fork if they're not headed upstream. Still, I can imagine a better loader would reduce the initial friction of playing with a new probe so it might be worth a try.

Looking through https://github.com/falcosecurity/libs/pull/1375, it seems like the vast majority of the patch is making the BPF program array dynamically-sized. Given the extent of changes required, I think I'd rather introduce a separate bpf_attached_prog_list for custom programs (btw, I have opinions about the name :P).

For v1, we can unconditionally load all custom programs in load_single_prog and append their info to the new array (like in your patch), for v2 I'd rather see unused programs unloaded (or ideally not even loaded in the first place, avoiding potential verifier issues) but that's a general issue, not with your patch.

Again for v1, enforce_sc_set can unconditionally attach/detach the programs based on handle->capturing, but I'd much rather see a callback there (defaulting to the handle->capturing logic so e.g. upstream falco doesn't need to care). In an ideal world, we'd have a way to control the custom tracepoints with some sort of callback (e.g. via ->configure) but I understand nobody needs this right now so let's ignore it ;)

In scap_bpf_close, we detach and clean up all the custom programs from the extra table.

(I had other, way more complex ideas for this but I believe this is a decent balance between giving custom probes the flexibility and minimizing impact on the built in tracepoints).

erthalion commented 11 months ago

Nope, we were actually half way through dropping the fork in favour of the upstream, when we found out it's not possible due to changes under our feet.

Hmm not sure I understand how that's possible. You do need a custom probe (that's the entire issue) so you do have to maintain it somewhere, right? I'm genuinely trying to understand.

No worries, it's fine :) In our case the custom bpf probe lives outside the Falco tree, and we combine everything together in a drop-in manner during the build time. @Stringy can give more details about the process.

incertum commented 11 months ago

Quick note: @gnosek I like the progression you described here https://github.com/falcosecurity/libs/issues/1527#issuecomment-1845012162 very much and appreciate the technical details as well 😉 I also think that down the road (assuming the custom probe you maintain @erthalion is successful) it would be nice to consider upstreaming the probe.

So is https://github.com/falcosecurity/libs/issues/1527#issuecomment-1845012162 an ok, we have a path forward?

leogr commented 11 months ago

I really appreciated the ongoing discussion and recognized some potential benefits, though I don't have strong opinions yet.

I think clarity on the proposal's long-term implications, especially regarding maintenance costs and overall project direction, has been overlooked a bit, IMO. After having gone through all the comments, I realized I have more questions than answers :sweat_smile: I've tried to collect them and report them below to help reach clarity.

Scope:

Is this proposal tailored exclusively for libs adopters, or does it may have broader implications for the Falco project?
- (I know "The possible use case for Falco is not yet relevant.", but I'm more interested in understanding if we want to explicitly limit this proposal to libs only in order to set level expectations correctly)
How will this affect prioritization in the Falco project?
Are there alternatives under consideration?

Maintenance Costs:

Considering the libs project's high maintenance demands, what is the expected maintenance cost of implementing this proposal?
Most importantly, is anyone willing to commit to maintaining this?
Also, what is the expected impact of this proposal on existing components (ie the verifier with the legacy probe is already a pain) and ongoing efforts like code cleanups, refactoring, API revisions, etc.?
If the proposal's long-term goals include upstreaming alternative probes, how do we plan to maintain them?

Drivers function parity:

Will this proposal aim to discontinue the current function parity between our drivers?
Eventually, how do we plan to manage potential disparities between different drivers?

Event schema compatibility:

Is maintaining compatibility with the current event schema a primary goal of this proposal? (Will it be entirely opaque for Falco?)
Or do we expect this to drive significant event schema changes in the future? (if so, how will Falco users be impacted?)

Other ideas/proposals (eg LSM):

How does this proposal relate to ongoing discussions and ideas around using LSM hooks / kprobes?
Can we envision a common route or strategy to address these types of technical challenges in a cohesive manner?

I really believe focusing on defining the proposal in detail before proceeding to a vote is crucial. In particular, I think goals, non-goals, and full extent must be specified before deciding. :pray:

erthalion commented 10 months ago

Thanks for the discussion, folks. I'll try to follow up step-by-step.

So is #1527 (comment) an ok, we have a path forward?

@gnosek @incertum Just to make sure I got it right:

For v1, we can unconditionally load all custom programs in load_single_prog and append their info to the new array (like in your patch), for v2 I'd rather see unused programs unloaded (or ideally not even loaded in the first place, avoiding potential verifier issues) but that's a general issue, not with your patch.

Again for v1, enforce_sc_set can unconditionally attach/detach the programs based on handle->capturing, but I'd much rather see a callback there (defaulting to the handle->capturing logic so e.g. upstream falco doesn't need to care). In an ideal world, we'd have a way to control the custom tracepoints with some sort of callback (e.g. via ->configure) but I understand nobody needs this right now so let's ignore it ;)

Does it mean the high level plan here is essentially what we are proposing, namely make the loader flexible enough to load custom programs from the probe (based on certain conditions)?

I think clarity on the proposal's long-term implications, especially regarding maintenance costs and overall project direction, has been overlooked a bit, IMO. After having gone through all the comments, I realized I have more questions than answers 😅 I've tried to collect them and report them below to help reach clarity.

@leogr I think it's important to distinguish those parts of the proposal that are about "decision/vote" on the high level approach (whether to give the bpf probe more responsibility), and implementation consequences. Having this in mind:

Scope:

Is this proposal tailored exclusively for libs adopters, or does it may have broader implications for the Falco project? (I know "The possible use case for Falco is not yet relevant.", but I'm more interested in understanding if we want to explicitly limit this proposal to libs only in order to set level expectations correctly)

I don't see any implications for the whole Falco project.

How will this affect prioritization in the Falco project?

Don't think there would be any special impact, aside of regular PR review process.

Are there alternatives under consideration?

As far as I see, the only alternative so far is do not have this feature at all.

Maintenance Costs:

Considering the libs project's high maintenance demands, what is the expected maintenance cost of implementing this proposal?

This one depends on the implementation. The way we see it, the only maintenance overhead would be to test the fact that new way of loading programs from the probe works.

Most importantly, is anyone willing to commit to maintaining this?

Depending on the implementation the scope could be different, but if everything goes fine there will be me, @Molter73 and @Stringy.

Also, what is the expected impact of this proposal on existing components (ie the verifier with the legacy probe is already a pain) and ongoing efforts like code cleanups, refactoring, API revisions, etc.?

Again, heavily implementation dependent, but we don't see any significant impact on other parts of the project.

If the proposal's long-term goals include upstreaming alternative probes, how do we plan to maintain them?

It's a very long-term vision, probably there could be different "flavours" of the probe. The maintenance in this case would be more runs of the testing pipeline, one for each flavour.

Drivers function parity:

Will this proposal aim to discontinue the current function parity between our drivers? Eventually, how do we plan to manage potential disparities between different drivers?

What kind of disparities do you have in mind here?

Event schema compatibility:

Is maintaining compatibility with the current event schema a primary goal of this proposal? (Will it be entirely opaque for Falco?)

Yep.

Or do we expect this to drive significant event schema changes in the future? (if so, how will Falco users be impacted?)

No, unless it fits into the future community development roadmap.

Other ideas/proposals (eg LSM):

How does this proposal relate to ongoing discussions and ideas around using LSM hooks / kprobes?

This proposal is orthogonal to those discussions, but could enhance any relevant implementation.

Can we envision a common route or strategy to address these types of technical challenges in a cohesive manner?

Well, that's the whole topic of the proposal -- the best strategy we see is to put more responsibility on the bpf probe itself, introducing a flexible interface to negotiate expectations what to load and, potentially in the future, which data is going to be produced.

leogr commented 10 months ago

Thank you @erthalion ! A few more comments point-by-point :point_down:

Scope: Is this proposal tailored exclusively for libs adopters, or does it may have broader implications for the Falco project? (I know "The possible use case for Falco is not yet relevant.", but I'm more interested in understanding if we want to explicitly limit this proposal to libs only in order to set level expectations correctly)

I don't see any implications for the whole Falco project.

I'm unsure about this, and I would like the proposal to be clear in this regard. My goal just is to ensure everyone is on the same page.

For instance, the implementation of this proposal might require some modification to the Driver API. The driver API is versioned, and the versioning allows to check the compatibility between a given version of Falco and a given version of a driver. Thanks to that, a given release of Falco is compatible with more driver releases. This is just an example (probably not the better one). Also, impacting the whole project in a minor way wouldn't likely be an issue, just the proposal should clarify that so we are aligned.

How will this affect prioritization in the Falco project?

Don't think there would be any special impact, aside of regular PR review process.

I actually meant if this would be an item for the Falco project roadmap or not. If not, this should be kind of best effort, which is ok, but will have very low priority compared to Falco items (for example, the two months before the release we are usually at full capacity on Falco items).

Maintenance Costs: Considering the libs project's high maintenance demands, what is the expected maintenance cost of implementing this proposal?

This one depends on the implementation. The way we see it, the only maintenance overhead would be to test the fact that new way of loading programs from the probe works.

It would be desirable to define a precise strategy and understand if this may affect the actual testing process since driver testing has been a huge effort historically (especially because of the verifier and testing across different versions of kernels, clang, etc..). Likely, the proposal need to define the implementation strategy, otherwise it would be hard to understand implications.

Most importantly, is anyone willing to commit to maintaining this?

Depending on the implementation the scope could be different, but if everything goes fine there will be me, @Molter73 and @Stringy.

Good to know! :+1:

Drivers function parity: Will this proposal aim to discontinue the current function parity between our drivers? Eventually, how do we plan to manage potential disparities between different drivers?

What kind of disparities do you have in mind here?

No one atm. Just wondering. Generally speaking, if a user-facing feature can be only implemented and not in the kmod, this will create a disparity.

Event schema compatibility: Is maintaining compatibility with the current event schema a primary goal of this proposal? (Will it be entirely opaque for Falco?)

Yep.

Ok, this is crystal clear now :+1:

PS I will need to find a bit of time to share some thoughts about possible LSM ideas. The goal is just to understand if we can join efforts or not.

incertum commented 10 months ago

driver testing has been a huge effort historically

This proposal for now is about scap changes only to load a custom probe. Therefore, I don't see this impacting our drivers testing.

parity between our drivers

Since this is for libs only and not Falco for now, I also don't see a problem here. In fact we already have driver disparities in libs today. For instance we support some archs only for some drivers. Also I don't recall any such discussions for expanding archs support, so I am still a bit puzzled why we now shift criteria.

Does it mean the high level plan here is essentially what we are proposing, namely make the loader flexible enough to load custom programs from the probe (based on certain conditions)?

re the question for me @erthalion: I don't see why this would pose a problem or big impact for libs. Rather the opposite, it will allow for more custom libs adoption. So yes "make the loader flexible enough to load custom programs from the probe" 👍

@leogr would propose to keep this discussion more focused on supporting "loading a custom probe" and having "an agreement that we will allow the bpf loader to be flexible to do that". I think it will make libs much more flexible and drive libs adoption which is excellent for the overall project health. In addition, three engineers who already commit to maintaining it is a privilege in open source.

Everything else I would defer to a formal proposal "overall driver and event filtering modernization" where we design the future of all things LSM hooks, kernel side filtering, event schemas and event handling etc. If no one else wants to lead starting such a proposal I can do it.

poiana commented 6 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 5 months ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana commented 4 months ago

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community. /close

poiana commented 4 months ago

@poiana: Closing this issue.

In response to [this](https://github.com/falcosecurity/libs/issues/1527#issuecomment-2198243361): >Rotten issues close after 30d of inactivity. > >Reopen the issue with `/reopen`. > >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Provide feedback via https://github.com/falcosecurity/community. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.