[Action/Falco] Investigate benchmark test data collection

nikimanoledaki commented 9 months ago

The first benchmark test for Falco will be a baseline test.

We can start with a script that runs as a Cron Job in Kubernetes. In the future, we can automate this using self-hosted GitHub Action runners (see Proposal 1: Actions Runner Controller (ARC)).

Pre-requisite

[ ] Add a Kubernetes Cron Job manifest in clusters/projects/falco/cron.yaml

Benchmark Steps

[ ] 1. For the given duration (e.g. 15 min): sleep OR create some kernel stress
[ ] 2. Write the Prometheus metrics in the output of the job for each of the metrics (e.g. Kepler's kepler_container_joules_total ) for the given duration (e.g. 15min).

Acceptance Criteria

Provide a script with reproducible steps
Have a set duration for the workload, for example, 15 minutes
Track and export the metrics that are the output of this test in a consistent manner

raymundovr commented 9 months ago

Hi,

I'd love to work on this one, however, I'd like to get some help to clarify the reqs before jumping into it.

I notice in the parent issue that another script has been implemented for an "infra-component" and that this issue could work in a similar fashion, correct? If so, could you point me out to this? ~~2. For suggested step 2, what do you mean with point a) "do not deploy the microservice workload"?~~ Update: found it within the document.

If someone else has it more clear and wants to pair on this, I'm also open to it :)

nikimanoledaki commented 9 months ago

@raymundovr thank you for volunteering to contribute to this! 🥳

I notice in the parent issue that another script has been implemented for an "infra-component" and that this issue could work in a similar fashion, correct? If so, could you point me out to this?

I think @AntonioDiTuri was referring to how we deployed some infrastructure-level components with Flux. Essentially, we can add a Kubernetes manifest in a directory watched by Flux and Flux will apply it in the cluster, like this ConfigMap: https://github.com/cncf-tags/green-reviews-tooling/blob/main/clusters/base/kepler-grafana.yaml

As part of this issue, we would need to add a manifest for a Kubernetes CronJob in the clusters/base/ directory. The CronJob itself would contain the steps listed in the description.

Before jumping into it, I agree that we should refine the requirements a bit more. Shared this with Flaco maintainer @incertum and waiting for her feedback: https://github.com/falcosecurity/cncf-green-review-testing/discussions/13

nikimanoledaki commented 9 months ago

@raymundovr hi! 👋 I created an issue to bootstrap ARC (https://github.com/cncf-tags/green-reviews-tooling/issues/58) but there are quite a few requirements and we have no guarantee that ARC will be up and running in time for us to go straight to the self-hosted runners solution by KubeCon.

I propose that we get a head start with the bash script + CronJob as a temporary solution. That way we're not blocked by the authorization with PAT keys, secrets, etc etc. And then when ARC is ready, we can port the individual steps to a GitHub Actions workflow (it's relatively easy to split/convert bash scripts into GA workflow steps). What matters the most is that we implement the steps themselves one way or another so that we can gather sample metrics!

To deploy the CronJob, we can add a CronJob manifest in clusters/projects/falco for Flux to deploy it. Flux can deploy raw manifest, similarly to how it's deploying this ConfigMap that contains the Kepler dashboard: https://github.com/cncf-tags/green-reviews-tooling/blob/main/clusters/base/kepler-grafana.yaml

Then we can the steps one by one :)

What do you think? :)

nikimanoledaki commented 9 months ago

I believe the most challenging part will be the last step, which is to find a way to gather + store the metrics per test. Pushgateway is one option, which @rossf7 suggested (Slack context), but really we will understand more once we get to that step!

raymundovr commented 9 months ago

Hi @nikimanoledaki Thank you for elaborating further.

Unfortunately I don't have a full overview of the cluster resources and its accessibility to give a more informed opinion on when ARC could be ready, therefore for me it's ok to start with the cron job, as suggested :) ~~Will also comment out the microservices.yaml, as discussed and identified last time.~~ Update: Waiting on Falco team to explain a bit further the purpose of the microservices in conjuction with the stress test, see here.

rossf7 commented 9 months ago

Hi @raymundovr, makes total sense to start with the cron job and the steps in the job and the github action should be very similar.

Just a heads up that I've started work on adding the self hosted runner in https://github.com/cncf-tags/green-reviews-tooling/pull/63 I'll let you know once its deployed and tested.

raymundovr commented 9 months ago

Hi @rossf7 @nikimanoledaki

I've been researching and playing with kepler at work during the last couple of days and have gained an understanding on the architecture and the options that could be considered to obtain, and set, an appropiate base line and test scenarios measurements. I'd like to share this with you and discuss further the possible next steps :)

First, I'd like to mention that I think that the idea of creating any task / job to gather kepler metrics to push them into Prometheus might be redundant as kepler itself deploys with kepler-exporter, which is then scraped by Prometheus, then it is possible to observe these metrics via Grafana. The deployment that you've already launched already shows this, see here.

With this in consideration, I'd like to discuss the possibility to take the corresponding measurements from Prometheus (using PromQL / API queries) and present them in a consistent way, for example:

Take the measurements from Prometheus on a node running Falco as its sole Deployment (i.e., nothing else is running). This came to my mind after reading the comment from the Falco team on the stress test and its interaction with Falco.
Launch a CronJob to sleep for a certain time and take the measurements.
Launch the stress test, for the same amount of time as the sleep, and take measurements.
Launch the stress test + demo microservice deployment and take measurements.

The measurements could be taken by a task running on a separate node, prompting Prometheus, as mentioned before, for the corresponding Pod or Namespace where Falco is running. Then it will output a table-like format which can be used as a way to calculate, for example, max, min, average and median.

We can then decide where to store this output, or perhaps make it available as a service? We'll still need a way to trigger this measurements task, though.

What do you think?

rossf7 commented 9 months ago

Hi @raymundovr, thanks for sharing this.

Take the measurements from Prometheus on a node running Falco as its sole Deployment

Yes, you can query Prometheus for the Kepler metrics and we should have just Falco running on its node.

We've set node labels for this so you can add the selectors cncf-project=wg-green-reviews and cncf-label-sub=internal to the CronJob and the stress test pods?

For microservices-demo it doesn't look like the helm chart lets you set a node selector :( they do also support kustomize and it might let us patch the deployments?

https://github.com/GoogleCloudPlatform/microservices-demo/tree/main/helm-chart

Then it will output a table-like format which can be used as a way to calculate, for example, max, min, average and median.

Perfect, logging the results will let us validate the test steps and the measurements.

We can then decide where to store this output, or perhaps make it available as a service?

Yes, we still need to work on this part of the design. We could write the results to S3 for example but lets tackle this as a later step.

cc @nikimanoledaki @AntonioDiTuri

raymundovr commented 9 months ago

Hi,

Thank you @rossf7 for the suggestion. I'm not sure that the labels are exported as selectors into Prometheus. I'm currently under the impression that querying by namespace would be a quick and viable way.

I have started to put something togeher, please check https://github.com/raymundovr/sustainability-metrics/blob/main/main.go

In order to continue I'll probably need access to Prometheus, is there any chance to get that? I don't mind setting up any kind of tunnel if necessary.

What do you think of this approach?

cc @nikimanoledaki

rossf7 commented 9 months ago

Hi @raymundovr, as discussed this morning you have readonly access to the cluster now.

Your code looks good! When you're ready we can move it to this repo. I think we could have the go module in the root and your code in cmd/main.go. WDYT?

For the github action we can use setup-go to install go and then run the binary. I don't think we need a container image yet. The action will also need a kubeconfig so we could use port forwarding for Prometheus or make the Prometheus API public?

I'm not sure that the labels are exported as selectors into Prometheus. I'm currently under the impression that querying by namespace would be a quick and viable way.

Yes, you're right we can't use the k8s labels. I think using the container_namespace prometheus label as you're doing is good.

There is an another factor which is the Falco team would like 3 deployments of Falco on different nodes as Falco has multiple drivers. See https://github.com/falcosecurity/cncf-green-review-testing/issues/2

So far just a single node with the modern-ebpf driver is provisioned. We could use the instance prometheus label which has the node name. I don't really like that approach but I can't see a better option right now.

AntonioDiTuri commented 9 months ago

Thanks @raymundovr and @rossf7 for moving this forward! I took a look at the code and it looks like a good starting point.

The metrics that @raymundovr selected for the moment are:

Id: "kepler_dram",
Query: (`sum by (pod_name, container_namespace)(irate(kepler_container_dram_joules_total{container_namespace=~"%s",pod_name=~".*"}[1m]))

Id: "kepler_package",
Query: (`sum by (pod_name, container_namespace (irate(kepler_container_package_joules_total{container_namespace=~"%s",pod_name=~".*"}[1m]))`, *projectNamespace),
Id: "cpu_utilization_node",
Query: (`instance:node_cpu_utilisation:rate5m{job="node-exporter", instance="%s", cluster=""} != 0`, *node),

@nikimanoledaki since you have some experience with the kepler metrics, do you think this is enough?

For a first implementation I guess it is fine to print the metrics in the output, then we can refine it later :)

raymundovr commented 9 months ago

Thank you @rossf7 and @AntonioDiTuri for taking a look and providing feedback. Indeed, what I'm outlning a bit here is:

A time based test with a series of queries at a given interval.
A way to define queries that might be interesting to observe for a test. Currently all are static, but nothing impedes that in the future this changes.
A parameterized script, where important things can come as arguments. Can include desired queries later on.

What do you think @nikimanoledaki ?

raymundovr commented 9 months ago

Update: cleaned things a bit and added kepler_container_joules_total metric.

cncf-tags / green-reviews-tooling