[Tracking] CI Integration for Driver Kernel Version Testing

incertum commented 2 years ago

Kicking off a discussion as result of https://github.com/falcosecurity/libs/pull/524

CI Integration / "Fun" for tool developer -> sanity for everyone :)

Success Criteria:

Daily confirmation that driver state did not degenerate compared to yesterday (TBD what this means).
Within one day of brand new kernel release (major, minor) receive automated report around possible compatibility issues.
Stronger SLOs (major refactors in driver code shall require very significant testing to avoid subtle breaking changes that bear a high cost -> loss of confidence in tooling for production grade deployments).

Requirements:

Archs
- [ ] x86_64
- [ ] aarch64
Drivers
- [ ] driver-bpf
- [ ] driver-modern-bpf
- [ ] driver-kmod
Compiler versions
- [ ] clang-5.0 - latest
- [ ] gcc-5 - latest
Auto discovery of kernel release grid
- Distros
  - [ ] TBD
    - [ ] driver-bpf (>= 4.14)
    - [ ] driver-modern-bpf (>= 5.8, start clang-12)
    - [ ] driver-kmod (>= 2.6)

Steps:

Build metadata database for kernel release builds as result of bot flying around and scraping the web (SQL upserts) -> can achieve more complete coverage than currently supported driver releases
Build every driver type for every arch for compiler version - kernel release grid
- CI shall do daily kernel sampling from the metadata database
- Create a more sophisticated sampling algorithm -> there is higher value in sampling non-directly neighboring kernel releases also across distros while giving higher weights to LTS releases
- Option for a broader / more brute force test grid
scap-open test run verification

Discussion: @FedeDP , @jasondellaluce, @Molter73 , @loresuso

Given it is such a daunting task, why not instead start with adding a compiler version grid for few kernels only to the CI? Perhaps that is already enough for more complete sanity checks than is currently being done in the CI? Let's not forget that 80/20 wins the day and incremental improvements are much more important.

Molter73 commented 2 years ago

Hi there! I believe validation on the drivers is a super valuable thing and we should definitely have it as part of our CI.

The 2 biggest challenges here will be:

Building every possible combination of supported kernel, compiler and driver.
Actually validating each driver

From my experience, building the drivers will always take a huge amount of time and resources. I believe we could instead narrow it down to a few representative kernels for each distro, build and test just those and hope all others still work because changes between them are small enough. It should still help catch compilation and validation errors without taking hours (or even days) for every change added to the repo.

I saw you mentioned #506 on your PR, that is still in early stages but my next 2 steps for it once they are merged are:

Run the tests on GHA (which should be relatively straight forward, I've already done it in my PoC).
Get some VMs on some cloud provider, build the drivers for those when they come on line and run the tests (I haven't discussed this officially yet though, so we'll see if it makes sense from a bureaucratic/budget point of view).

Obviously the e2e test goes through the process of validating the eBPF probe and it captures a few syscalls, but it is by no means an exhaustive test going through every feature of the libs, so if we can get something that tests the drivers more thoroughly like @Andreagit97 is doing for the modern probe, this has a big +1 from me.

incertum commented 2 years ago

Hi there! I believe validation on the drivers is a super valuable thing and we should definitely have it as part of our CI.

The 2 biggest challenges here will be:

Building every possible combination of supported kernel, compiler and driver.

Actually validating each driver

From my experience, building the drivers will always take a huge amount of time and resources. I believe we could instead narrow it down to a few representative kernels for each distro, build and test just those and hope all others still work because changes between them are small enough. It should still help catch compilation and validation errors without taking hours (or even days) for every change added to the repo.

Let's do this 🚀

Should we just use some kernels from existing https://github.com/falcosecurity/kernel-crawler? Still do some sampling (can be a daily menu to take advantage of caching) in order to keep a bit of a chaosmonkey component?

Unsure about CI constraints. Nice would be to re-use existing components as much as possible.

I see a kind cluster, it's for sure in AWS which gives options ... what shall we use as guest OS for tests? Regular VMs, KubeVirt VMs?

Would we boot into target kernel at start-up or do we want to keep re-booting few times into kernels? In the case of re-booting maybe add kernels links to https://github.com/falcosecurity/kernel-crawler in addition to the headers links?

How about making such an approach (compile driver for array of compilers, option to pass KERNELDIR to not extract headers multiple times) first citizen in https://github.com/falcosecurity/driverkit (would plan for better continuity)? Would however add a bit of a twist to existing approach.

I saw you mentioned #506 on your PR, that is still in early stages but my next 2 steps for it once they are merged are:

Run the tests on GHA (which should be relatively straight forward, I've already done it in my PoC).

Nice!

Get some VMs on some cloud provider, build the drivers for those when they come on line and run the tests (I haven't discussed this officially yet though, so we'll see if it makes sense from a bureaucratic/budget point of view).

Feel like this would be most stable, we only need super small VMs

Obviously the e2e test goes through the process of validating the eBPF probe and it captures a few syscalls, but it is by no means an exhaustive test going through every feature of the libs, so if we can get something that tests the drivers more thoroughly like @Andreagit97 is doing for the modern probe, this has a big +1 from me.

Agreed, like the new awesome sinsp e2e test shouldn't need to worry about kernel version or compiler much, this should be another test and then additional unit tests and such, I gladly take them, can't have enough tests :)

incertum commented 2 years ago

Who is primary owner of CI? What is the typical approach to drive such a bigger effort to completion? Tag @LucaGuerra

incertum commented 2 years ago

[FUTURE] - Add stress tests to CI (periodic)

Stress-tests and benchmarks on synthetic workloads (small)
- https://github.com/falcosecurity/libs/pull/115 💯
- https://github.com/falcosecurity/libs/issues/267 💯
More realistic large stress-tests to quantify CPU usage (hopefully improvements) -> derive some SLOs, SLAs, set expectations of performance overhead

leogr commented 2 years ago

cc @LucaGuerra @FedeDP

LucaGuerra commented 2 years ago

Great discussion! I'd like to add some thoughts on this since, as some of you know, I'm happy to help improving Falco testing in all directions.

I essentially agree with @Molter73 , the most practical way forward I can think of right now would be creating a VM pool with small VMs from cloud providers. At every PR we can upload & build the driver(s) for each OS installed in the VMs, load it, run some simple e2e tests, and reboot the VM afterwards.

The way to do it with, for example, GitHub Actions, is by getting self-hosted runners and label them properly. In my opinion, the coolest parts are:

We don't need to build actual kernels, instead we'll have them already set up, with the right distro flavor (e.g. popular versions of ubuntu-aws out of the box);
We can use every combination of architecture and OS, since the workers can be hosted by multiple cloud providers and that will work at native speed with no emulation necessary;
We don't have to wait for VMs to start up as they're already there: the less we have to wait for something that can time out the less flakiness we'll have. I mean, it would be super cool to start from raw kernel sources and build an image, boot into it and load our kmod/ebpf probe but I feel that'd be hard to implement and maintain since it can fail in many different ways.

The downsides are:

This requires manual intervention to refresh and update the workers;
A bug in the kmod in a single PR might panic entire nodes and halt the whole CI, which might be painful to troubleshoot. I mean, that's still a lot better than shipping a buggy driver but the whole CI is not technically "isolated" as it would be a regular VM, maybe we can mitigate this with automatic provisioning from cloud providers;
The amount of VMs we can maintain is probably limited and so is the amount of kernel flavors and architectures that we can test.

Re. the OSs to test I believe we could pick the supported versions from the most popular distros such as Ubuntu(-AWS), AmazonLinux2, CentOS, Debian, Google COS... Perhaps we can select 10 representative distros, with at least one ARM, and install our workers there.

The open point here is about funding/hosting the VMs themselves. I don't have much insight about this unfortunately. Of course this approach is feasible only if we can solve this :D , otherwise we'll have to consider other approaches.

incertum commented 1 year ago

@FedeDP and @Molter73 see updates in https://github.com/falcosecurity/libs/pull/524 compiling about 64 drivers was possible in about 1 min (using pre-downloaded and pre-extracted kernel headers).

Would say the design @LucaGuerra proposed seems pretty awesome :rocket: . Assuming ready to use VMs in addition to pre-extracted kernel headers, a CI workflow of max 10 - 15 min should be achievable.

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 1 year ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

alacuku commented 1 year ago

/remove-lifecycle rotten

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 1 year ago

/remove-lifecycle stale

incertum commented 1 year ago

@maxgio92 @LucaGuerra as we are finalizing the proposal, we can kick of the discussion around implementation details.

Went ahead and asked some Red Teamers I know who are masters in setting up all sorts of shell boxes etc 🙃 , please read below suggestion as one solution we could explore:

... maybe something like how Algo takes a Digital Ocean API key and performs the installation/management through Ansible might be a strategy (https://github.com/trailofbits/algo). The repo also has support for other providers like GCE, AWS, etc so it could be an example of how to automate setups/processing/teardown generically.

Looks like there's an official Github Action for interacting with the Digital Ocean API too: https://github.com/digitalocean/action-doctl.

So perhaps it might look like:

Github Action starts

A new DO droplet is spun up with the DO API, get ssh keys with the doctl command.

An Ansible workflow uploads files/installs deps/runs the tests/report results

The DO droplet is torn down and deleted

Github Action ends

https://github.com/trailofbits/algo https://github.com/digitalocean/action-doctl

@maxgio92 would you want to outline your idea here as well? We probably should try at least 2 solutions as things are gonna be very finicky and we need some flexibility.

maxgio92 commented 1 year ago

Hi all, thank you for this very interesting discussion.

I'm going to summarize all points and discussed ways to go - correct me if I missed something :)

virtualization:
- nested on self-hosted (see #524) - also from microVM as actuated provides
- managed (cloud-provided)
control-plane:
- KVM + libvirt (also e.g. through Vagrant)
- cloud provider
- Kubernetes with KubeVirt
environment availability:
- always running pet VMs - e.g. EC2 instance as GitHhub runner as suggested by @LucaGuerra
- turned off-by default pet VMs
- ephemeral VMs
pipeline stages:
- drivers: build, publish, load, e2e test (e.g. with scap-open, or even from libinsp?) the drivers against the kernel in VMs, as declared in the driver-kernel testing grid
- VMs: create the related VMs with the declared image/kernel release
- (alternatively): create the related VMs with declared kernel release consumed from a database (produce ISOs)
action items:
- produce a static driver-kernel testing grid
- desired:
  - make the grid dynamic, filling it with kernel-crawled data.
  - fill it with GCC versions with which kernels have been built.
  - make driverkit accept an arbitrary GCC version (feasible?), besides the static one, in order to consume the data of the previous point.
- produce a GHA pipeline that manages the drivers and VMs lifecycle for end-to-end testing

I love the idea of spinning up ephemeral VMs without involving cloud providers. It gives us flexibility and the ability to cover a large range of targets. About the support for architecture, we could have a hybrid approach where we have one self-hosted GH runner and related VM per architecture, that supports nested virtualization (or bare metal), in order to boot on them different targets VM. Moreover, I'd like to highlight that in general, the ephemerality would help for more sustainability :)

At the same time, I think we could start simple by leveraging managed services, instantiate and jump into shells to load drivers and run the e2e tests - e.g. as @incertum suggested similarly to https://github.com/trailofbits/algo with Ansible, too. Then, we can work to reach the point above (as for #524).

I added a bit regarding a desired goal: the supported compiler versions in Driverkit, right now, are limited, as they depend on the specific static builders. Furthermore, we don't have the GCC versions used to build the target kernel, from the crawled kernel data. It would be optimal to crawl that, and let Driverkit install the target GCC/clang version at runtime - as discussed with @FedeDP. I'm not sure now how much it's feasible and if the effort is worth, considering the already intensive pre-built distribution through the DBG.

Sorry to introduce more doubts :-D I hope this could push the discussion forward and outline action items soon :-)

incertum commented 1 year ago

@maxgio92 ❤️ few comments

KVM + libvirt (also e.g. through Vagrant) cloud provider Kubernetes with KubeVirt

I don't know about your experience but based on my work on the localhost vagrant + VBox PR libvirt was a dumpster fire and just wouldn't reliably work, so I abandoned libvirt and VBox it was.

KubeVirt yes it works have used it, but also not considered too stable, but I think it could work for us.

Personally still fan of just stable VMs in cloud providers, DigitalOcean would seem super easy to use. We definitely need AWS EC2 as well etc - if VMs keep running or not probably not too important at the beginning, we can just leave them 24/7 on and re-kick them after kmod tests.

And also a fan of giving the ephemeral VMs the way you described a try.

Everything else you described 👍 and scap-open will do at the begging let's not overcomplicate things.

Great callout regarding the desired compiler version per kernelrelease, @FedeDP 😎 maybe we need a slight schema update after all https://github.com/falcosecurity/kernel-crawler/issues/36#issue-1331274416 really just the recommended_compiler_version ... would this be in the realm of possibility? Other ideas?

More importantly yes what are going to be the concrete next steps?

@maxgio92 would you want to explore the ephemeral VMs? Who would want to take the lead for the cloud providers? Ideally someone who is primary test-infra person already? More users can be added to help with testing, but one person should create and manage official accounts and all that fun stuff ...

incertum commented 1 year ago

Closing in favor of the new and more concrete tracking issue https://github.com/falcosecurity/libs/issues/1191 :tada:

falcosecurity / libs

[Tracking] CI Integration for Driver Kernel Version Testing #531