falcosecurity / cncf-green-review-testing

Falco configurations intended for testing with the CNCF Green Reviews Working Group
Apache License 2.0
1 stars 2 forks source link

[Action] Decide framework for benchmark tests #16

Open nikimanoledaki opened 4 months ago

nikimanoledaki commented 4 months ago

Motivation

As part of the CNCF Green Reviews WG's milestone for KubeCon+CloudNativeCon Europe '24, our main goal is to create the first benchmark test for Falco.

Feature

Proposal 1: Self-hosted runners with Actions Runner Controller (ARC)

Self-hosted GitHub Action runners could help us achieve this. Specifically, the Actions Runner Controller (ARC). We could add self-hosted runner in this repo (falcosecurity/cncf-green-review-testing) so that the Falco maintainers have ownership of the benchmark tests.

The benchmark tests can then be run in the cluster where Kepler and Prometheus are running and collecting energy metrics, along with other metrics for the SCI.

Stretch: The workflow could be triggered when there are new releases of Falco. GitHub Action workflows can be triggered by the build pipeline through a worfklow_dispatch.

+

-

Alternatives

Proposal 2: bash script that runs as a Kubernetes CronJob

We could create and maintain bash scripts that run the steps. We could run these as Kubernetes Jobs.

+

-

Additional context

Suggested steps

  1. Validate that Falco is deployed and running in the falco namespace on the isolated worker node.
  2. Validate that the microservice workload is deployed in the falco namespace.
  3. Benchmark test: for example, reach a given kernel event rate by sending requests to one of the microservice demo's endpoints for the given duration (e.g. 15 min).

Benchmark Test Acceptance Criteria

incertum commented 4 months ago

Excellent outline @nikimanoledaki - Favoring the self-hosted GitHub Action runners option. @maxgio92 is our infra management expert.

nikimanoledaki commented 4 months ago

Wonderful - we'll start on this immediately.

In the meantime, we will need your help with the following requirements:

nikimanoledaki commented 4 months ago

Also - @incertum are you currently using the microservice demo that is currently deployed on the cluster for these stress test or planning to use it? Or can we remove it from the cluster for now? We can just comment it out so that Flux stops reconciling it. Please let me know :)

maxgio92 commented 4 months ago

Hi @nikimanoledaki, thank you for the detailed proposal. I also prefer the ARC way. I like also the idea of relating a green-review benchmark to a specific Falco release.

I'd propose to guarantee quality of service for the benchmark jobs and the ARC. For the benchmark I'd provision a dedicated node pool, if the cluster is shared with the energy monitoring services. For the ARC I don't think a node pool is needed - I guess the system pool for the energy monitoring services is ok - , but maybe just setting the guaranteed QoS at pod level.

WDYT?

nikimanoledaki commented 4 months ago

Hi @maxgio92! ๐Ÿ‘‹

@rossf7 has been working on provisioning an isolated worker node for the falco namespace + components, which is nearly complete:

Please let us know if you have suggestions on any further isolation that could help with the benchmark tests :)

I'm not 100% sure if it would be best for the ARC runner Pod to run on the system node or the Falco-only node. I don't think it should run in the test environment - running everything ARC-related on one of the system nodes would be better. WDYT? ๐Ÿค”

rossf7 commented 4 months ago

Hi @maxgio92, yes, as Niki says separating the components to run on separate nodes is nearly complete. We just need to add a node selector to our Flux pods.

For the benchmark I'd provision a dedicated node pool, if the cluster is shared with the energy monitoring services.

Yes, we will provision dedicated nodes for Falco using the labels defined in https://github.com/falcosecurity/cncf-green-review-testing/issues/2 this is done via our tofu automation.

I'm not 100% sure if it would be best for the ARC runner Pod to run on the system node or the Falco-only node. I don't think it should run in the test environment - running everything ARC-related on one of the system nodes would be better. WDYT?

@nikimanoledaki I think it would be better to run the ARC pods on our system node. To keep the nodes we're collecting measurements on as isolated as possible.

If we get short on resources we could move some of our internal components to the control plane node.

maxgio92 commented 4 months ago

Thanks @rossf7 and @nikimanoledaki! I agree on scheduling ARC on system nodes.

incertum commented 4 months ago

Also - @incertum are you currently using the microservice demo that is currently deployed on the cluster for these stress test or planning to use it? Or can we remove it from the cluster for now? We can just comment it out so that Flux stops reconciling it. Please let me know :)

We are not using it yet, but yes please keep it deployed. Much appreciated!

raymundovr commented 4 months ago

Hi @incertum ๐Ÿ‘‹

We are not using it yet, but yes please keep it deployed. Much appreciated!

During our last discussion, we were not sure on the goal of this microservices deployment, it was also noticed that there's a stress test Deployment shipped [1] and [2]. To enrich our discussions, could you please explain a bit how these two components interact / play together with Falco and what are the plans for them? Thanks!

incertum commented 4 months ago

@raymundovr

During our last discussion, we were not sure on the goal of this microservices deployment, it was also noticed that there's a stress test Deployment shipped [1] and [2].

We previously discussed that for a v1 we will use the following synthetic workloads:

explain a bit how these two components interact / play together with Falco and what are the plans for them? Thanks!

Hi, we added a lot of new documentation to our website (https://falco.org/) explaining what Falco does and how it works if you are interested in more details. Falco is a Linux kernel security monitoring tool, passively hooking into syscall tracepoints. The more syscalls happen on a server the more work Falco has to do (simplified). Notably, Falco does not interact with synthetic workloads, rather, we use them to increase the frequency of syscalls, thereby making our testbed resemble real-life production environments where a diverse set of applications runs 24/7.

What additional questions do you have for us?

nikimanoledaki commented 4 months ago

A few questions provided by @roobre :)

raymundovr commented 4 months ago

Thank you @incertum for the clarifications. It is really helpful! @nikimanoledaki on the second point, I think it's what @incertum said:

[...]

  • stress-ng to add some static 24/7 baseline syscalls activity from our side, because Falco uses no CPU when nothing really runs on a server. [...] The more syscalls happen on a server the more work Falco has to do (simplified). Notably, Falco does not interact with synthetic workloads, rather, we use them to increase the frequency of syscalls, thereby making our testbed resemble real-life production environments where a diverse set of applications runs 24/7.
nikimanoledaki commented 4 months ago

Thanks @raymundovr & @incertum. Rewording my questions for clarity:

  1. Does any type of syscall trigger Falco, and does the type of stressor matter?

For example, we discussed specific syscalls from I/O or networking in the past. However, we're doing stress-ng --matrix 1:

--matrix N start N workers that perform various matrix operations on floating point values. Testing on 64 bit x86 hardware shows that this provides a good mix of memory, cache and floating point operations and is an excellent way to make a CPU run hot. By default, this will exercise all the matrix stress methods one by one. One can specify a specific matrix stress method with the --matrix-method option.

This uses stress-ng to stress the CPU through mathematical operations, as opposed to I/O read/writes or networking-related syscalls.

A different way to do this would be with --class, where we can track the class of stressor:

specify the class of stressors to run. Stressors are classified into one or more of the following classes: cpu, cpu-cache, device, io, interrupt, filesystem, memory, network, os, pipe, scheduler and VM.

I'm trying to understand if we want to log the type of stressor as a variable. Does it matter? Or does it not matter as long as the target kernel event rate is reached?

  1. Would stress-ng, the microservice demo, and redis be used as separate workloads or together?

This is just for me to understand how we're setting up the benchmark tests but I fully trust @incertum and team with owning the test scenarios etc. Thank you! :)

AntonioDiTuri commented 4 months ago

Hi I am trying to sum up here a very interesting discussion we had around the proposal for the benchmark test in the public slack channel of the worker group. Thanks @leonardpahlke for suggesting public runners, @nikimanoledaki for stearing the discussion and all the others partecipating @rossf7 @ dipankardas011.

This is the 3rd proposal: Modular GitHub Action workflow (public runners)

Here you can find an overview drawn by @leonardpahlke:

2024-02-14 TAG ENV Wg Green Reviews Structure Draft

Workflow:

News:

Having multiple pipelines is more complex, we need to rely on others more (which is a big deal if we plan to support more projects in the future, less scalable, more operations). Security wise itโ€™s not good either (we go away from single point of auth -> transitive dependency).

The location and version of a reusable workflow file to run as a job. Use one of the following syntaxes: {owner}/{repo}/.github/workflows/{filename}@{ref} for reusable workflows in public and private repositories. ./.github/workflows/{filename} for reusable workflows in the same repository.

This approach emphasizes sustainability, collaboration, and operational simplicity, which are crucial for the ongoing success and scalability of the green-reviews-tooling initiative.

incertum commented 4 months ago

@nikimanoledaki I have responded here https://github.com/falcosecurity/cncf-green-review-testing/discussions/13#discussioncomment-8551883 to your feedback re the synthetic workloads composition, thank you!

incertum commented 4 months ago

@AntonioDiTuri ๐Ÿš€ thank you very much for taking the time and posting an update here https://github.com/falcosecurity/cncf-green-review-testing/issues/16#issuecomment-1957398220. Amazing, we are looking forward to receiving more clear templates or instructions. As a heads-up we need to be mindful of @maxgio92 availability as well not just mine since Max is our infra expert and we will need his help ๐Ÿ™ƒ.

Some initial feedback:

poiana commented 1 month ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 2 weeks ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

incertum commented 2 weeks ago

/remove-lifecycle rotten /remove-lifecycle stale