falcosecurity / libs

libsinsp, libscap, the kernel module driver, and the eBPF driver sources
https://falcosecurity.github.io/libs/
Apache License 2.0
222 stars 162 forks source link

[Tracking] CI Integration for Driver Kernel Version Testing #531

Closed incertum closed 1 year ago

incertum commented 2 years ago

Kicking off a discussion as result of https://github.com/falcosecurity/libs/pull/524

CI Integration / "Fun" for tool developer -> sanity for everyone :)

Success Criteria:

Requirements:

Steps:

Discussion: @FedeDP , @jasondellaluce, @Molter73 , @loresuso

Given it is such a daunting task, why not instead start with adding a compiler version grid for few kernels only to the CI? Perhaps that is already enough for more complete sanity checks than is currently being done in the CI? Let's not forget that 80/20 wins the day and incremental improvements are much more important.

Molter73 commented 2 years ago

Hi there! I believe validation on the drivers is a super valuable thing and we should definitely have it as part of our CI.

The 2 biggest challenges here will be:

From my experience, building the drivers will always take a huge amount of time and resources. I believe we could instead narrow it down to a few representative kernels for each distro, build and test just those and hope all others still work because changes between them are small enough. It should still help catch compilation and validation errors without taking hours (or even days) for every change added to the repo.

I saw you mentioned #506 on your PR, that is still in early stages but my next 2 steps for it once they are merged are:

Obviously the e2e test goes through the process of validating the eBPF probe and it captures a few syscalls, but it is by no means an exhaustive test going through every feature of the libs, so if we can get something that tests the drivers more thoroughly like @Andreagit97 is doing for the modern probe, this has a big +1 from me.

incertum commented 2 years ago

Hi there! I believe validation on the drivers is a super valuable thing and we should definitely have it as part of our CI.

The 2 biggest challenges here will be:

  • Building every possible combination of supported kernel, compiler and driver.
  • Actually validating each driver

From my experience, building the drivers will always take a huge amount of time and resources. I believe we could instead narrow it down to a few representative kernels for each distro, build and test just those and hope all others still work because changes between them are small enough. It should still help catch compilation and validation errors without taking hours (or even days) for every change added to the repo.

Let's do this πŸš€

Should we just use some kernels from existing https://github.com/falcosecurity/kernel-crawler? Still do some sampling (can be a daily menu to take advantage of caching) in order to keep a bit of a chaosmonkey component?

Unsure about CI constraints. Nice would be to re-use existing components as much as possible.

I see a kind cluster, it's for sure in AWS which gives options ... what shall we use as guest OS for tests? Regular VMs, KubeVirt VMs?

Would we boot into target kernel at start-up or do we want to keep re-booting few times into kernels? In the case of re-booting maybe add kernels links to https://github.com/falcosecurity/kernel-crawler in addition to the headers links?

How about making such an approach (compile driver for array of compilers, option to pass KERNELDIR to not extract headers multiple times) first citizen in https://github.com/falcosecurity/driverkit (would plan for better continuity)? Would however add a bit of a twist to existing approach.

I saw you mentioned #506 on your PR, that is still in early stages but my next 2 steps for it once they are merged are:

  • Run the tests on GHA (which should be relatively straight forward, I've already done it in my PoC).

Nice!

  • Get some VMs on some cloud provider, build the drivers for those when they come on line and run the tests (I haven't discussed this officially yet though, so we'll see if it makes sense from a bureaucratic/budget point of view).

Feel like this would be most stable, we only need super small VMs

Obviously the e2e test goes through the process of validating the eBPF probe and it captures a few syscalls, but it is by no means an exhaustive test going through every feature of the libs, so if we can get something that tests the drivers more thoroughly like @Andreagit97 is doing for the modern probe, this has a big +1 from me.

Agreed, like the new awesome sinsp e2e test shouldn't need to worry about kernel version or compiler much, this should be another test and then additional unit tests and such, I gladly take them, can't have enough tests :)

incertum commented 2 years ago

Who is primary owner of CI? What is the typical approach to drive such a bigger effort to completion? Tag @LucaGuerra

incertum commented 2 years ago

[FUTURE] - Add stress tests to CI (periodic)

leogr commented 2 years ago

cc @LucaGuerra @FedeDP

LucaGuerra commented 2 years ago

Great discussion! I'd like to add some thoughts on this since, as some of you know, I'm happy to help improving Falco testing in all directions.

I essentially agree with @Molter73 , the most practical way forward I can think of right now would be creating a VM pool with small VMs from cloud providers. At every PR we can upload & build the driver(s) for each OS installed in the VMs, load it, run some simple e2e tests, and reboot the VM afterwards.

The way to do it with, for example, GitHub Actions, is by getting self-hosted runners and label them properly. In my opinion, the coolest parts are:

The downsides are:

Re. the OSs to test I believe we could pick the supported versions from the most popular distros such as Ubuntu(-AWS), AmazonLinux2, CentOS, Debian, Google COS... Perhaps we can select 10 representative distros, with at least one ARM, and install our workers there.

The open point here is about funding/hosting the VMs themselves. I don't have much insight about this unfortunately. Of course this approach is feasible only if we can solve this :D , otherwise we'll have to consider other approaches.

incertum commented 1 year ago

@FedeDP and @Molter73 see updates in https://github.com/falcosecurity/libs/pull/524 compiling about 64 drivers was possible in about 1 min (using pre-downloaded and pre-extracted kernel headers).

Would say the design @LucaGuerra proposed seems pretty awesome :rocket: . Assuming ready to use VMs in addition to pre-extracted kernel headers, a CI workflow of max 10 - 15 min should be achievable.

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 1 year ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

alacuku commented 1 year ago

/remove-lifecycle rotten

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 1 year ago

/remove-lifecycle stale

incertum commented 1 year ago

@maxgio92 @LucaGuerra as we are finalizing the proposal, we can kick of the discussion around implementation details.

Went ahead and asked some Red Teamers I know who are masters in setting up all sorts of shell boxes etc πŸ™ƒ , please read below suggestion as one solution we could explore:

... maybe something like how Algo takes a Digital Ocean API key and performs the installation/management through Ansible might be a strategy (https://github.com/trailofbits/algo). The repo also has support for other providers like GCE, AWS, etc so it could be an example of how to automate setups/processing/teardown generically.

Looks like there's an official Github Action for interacting with the Digital Ocean API too: https://github.com/digitalocean/action-doctl.

So perhaps it might look like:

  • Github Action starts
  • A new DO droplet is spun up with the DO API, get ssh keys with the doctl command.
  • An Ansible workflow uploads files/installs deps/runs the tests/report results
  • The DO droplet is torn down and deleted
  • Github Action ends

https://github.com/trailofbits/algo https://github.com/digitalocean/action-doctl

@maxgio92 would you want to outline your idea here as well? We probably should try at least 2 solutions as things are gonna be very finicky and we need some flexibility.

maxgio92 commented 1 year ago

Hi all, thank you for this very interesting discussion.

I'm going to summarize all points and discussed ways to go - correct me if I missed something :)

I love the idea of spinning up ephemeral VMs without involving cloud providers. It gives us flexibility and the ability to cover a large range of targets. About the support for architecture, we could have a hybrid approach where we have one self-hosted GH runner and related VM per architecture, that supports nested virtualization (or bare metal), in order to boot on them different targets VM. Moreover, I'd like to highlight that in general, the ephemerality would help for more sustainability :)

At the same time, I think we could start simple by leveraging managed services, instantiate and jump into shells to load drivers and run the e2e tests - e.g. as @incertum suggested similarly to https://github.com/trailofbits/algo with Ansible, too. Then, we can work to reach the point above (as for #524).

I added a bit regarding a desired goal: the supported compiler versions in Driverkit, right now, are limited, as they depend on the specific static builders. Furthermore, we don't have the GCC versions used to build the target kernel, from the crawled kernel data. It would be optimal to crawl that, and let Driverkit install the target GCC/clang version at runtime - as discussed with @FedeDP. I'm not sure now how much it's feasible and if the effort is worth, considering the already intensive pre-built distribution through the DBG.

Sorry to introduce more doubts :-D I hope this could push the discussion forward and outline action items soon :-)

incertum commented 1 year ago

@maxgio92 ❀️ few comments

KVM + libvirt (also e.g. through Vagrant) cloud provider Kubernetes with KubeVirt

I don't know about your experience but based on my work on the localhost vagrant + VBox PR libvirt was a dumpster fire and just wouldn't reliably work, so I abandoned libvirt and VBox it was.

KubeVirt yes it works have used it, but also not considered too stable, but I think it could work for us.

Personally still fan of just stable VMs in cloud providers, DigitalOcean would seem super easy to use. We definitely need AWS EC2 as well etc - if VMs keep running or not probably not too important at the beginning, we can just leave them 24/7 on and re-kick them after kmod tests.

And also a fan of giving the ephemeral VMs the way you described a try.

Everything else you described πŸ‘ and scap-open will do at the begging let's not overcomplicate things.

Great callout regarding the desired compiler version per kernelrelease, @FedeDP 😎 maybe we need a slight schema update after all https://github.com/falcosecurity/kernel-crawler/issues/36#issue-1331274416 really just the recommended_compiler_version ... would this be in the realm of possibility? Other ideas?


More importantly yes what are going to be the concrete next steps?

@maxgio92 would you want to explore the ephemeral VMs? Who would want to take the lead for the cloud providers? Ideally someone who is primary test-infra person already? More users can be added to help with testing, but one person should create and manage official accounts and all that fun stuff ...

incertum commented 1 year ago

Closing in favor of the new and more concrete tracking issue https://github.com/falcosecurity/libs/issues/1191 :tada: