Open nikimanoledaki opened 9 months ago
Hi,
I'd love to work on this one, however, I'd like to get some help to clarify the reqs before jumping into it.
If someone else has it more clear and wants to pair on this, I'm also open to it :)
@raymundovr thank you for volunteering to contribute to this! 🥳
I notice in the parent issue that another script has been implemented for an "infra-component" and that this issue could work in a similar fashion, correct? If so, could you point me out to this?
I think @AntonioDiTuri was referring to how we deployed some infrastructure-level components with Flux. Essentially, we can add a Kubernetes manifest in a directory watched by Flux and Flux will apply it in the cluster, like this ConfigMap: https://github.com/cncf-tags/green-reviews-tooling/blob/main/clusters/base/kepler-grafana.yaml
As part of this issue, we would need to add a manifest for a Kubernetes CronJob in the clusters/base/
directory. The CronJob itself would contain the steps listed in the description.
Before jumping into it, I agree that we should refine the requirements a bit more. Shared this with Flaco maintainer @incertum and waiting for her feedback: https://github.com/falcosecurity/cncf-green-review-testing/discussions/13
@raymundovr hi! 👋 I created an issue to bootstrap ARC (https://github.com/cncf-tags/green-reviews-tooling/issues/58) but there are quite a few requirements and we have no guarantee that ARC will be up and running in time for us to go straight to the self-hosted runners solution by KubeCon.
I propose that we get a head start with the bash script + CronJob as a temporary solution. That way we're not blocked by the authorization with PAT keys, secrets, etc etc. And then when ARC is ready, we can port the individual steps to a GitHub Actions workflow (it's relatively easy to split/convert bash scripts into GA workflow steps). What matters the most is that we implement the steps themselves one way or another so that we can gather sample metrics!
To deploy the CronJob, we can add a CronJob manifest in clusters/projects/falco
for Flux to deploy it. Flux can deploy raw manifest, similarly to how it's deploying this ConfigMap that contains the Kepler dashboard: https://github.com/cncf-tags/green-reviews-tooling/blob/main/clusters/base/kepler-grafana.yaml
Then we can the steps one by one :)
What do you think? :)
I believe the most challenging part will be the last step, which is to find a way to gather + store the metrics per test. Pushgateway is one option, which @rossf7 suggested (Slack context), but really we will understand more once we get to that step!
Hi @nikimanoledaki Thank you for elaborating further.
Unfortunately I don't have a full overview of the cluster resources and its accessibility to give a more informed opinion on when ARC could be ready, therefore for me it's ok to start with the cron job, as suggested :)
Will also comment out the
Update: Waiting on Falco team to explain a bit further the purpose of the microservices in conjuction with the stress test, see here.microservices.yaml
, as discussed and identified last time.
Hi @raymundovr, makes total sense to start with the cron job and the steps in the job and the github action should be very similar.
Just a heads up that I've started work on adding the self hosted runner in https://github.com/cncf-tags/green-reviews-tooling/pull/63 I'll let you know once its deployed and tested.
Hi @rossf7 @nikimanoledaki
I've been researching and playing with kepler
at work during the last couple of days and have gained an understanding on the architecture and the options that could be considered to obtain, and set, an appropiate base line and test scenarios measurements.
I'd like to share this with you and discuss further the possible next steps :)
First, I'd like to mention that I think that the idea of creating any task / job to gather kepler
metrics to push them into Prometheus might be redundant as kepler
itself deploys with kepler-exporter
, which is then scraped by Prometheus, then it is possible to observe these metrics via Grafana. The deployment that you've already launched already shows this, see here.
With this in consideration, I'd like to discuss the possibility to take the corresponding measurements from Prometheus (using PromQL / API queries) and present them in a consistent way, for example:
sleep
for a certain time and take the measurements.The measurements could be taken by a task running on a separate node, prompting Prometheus, as mentioned before, for the corresponding Pod or Namespace where Falco is running. Then it will output a table-like format which can be used as a way to calculate, for example, max, min, average and median.
We can then decide where to store this output, or perhaps make it available as a service? We'll still need a way to trigger this measurements task, though.
What do you think?
Hi @raymundovr, thanks for sharing this.
Take the measurements from Prometheus on a node running Falco as its sole Deployment
Yes, you can query Prometheus for the Kepler metrics and we should have just Falco running on its node.
We've set node labels for this so you can add the selectors cncf-project=wg-green-reviews
and cncf-label-sub=internal
to the CronJob and the stress test pods?
For microservices-demo it doesn't look like the helm chart lets you set a node selector :( they do also support kustomize and it might let us patch the deployments?
https://github.com/GoogleCloudPlatform/microservices-demo/tree/main/helm-chart
Then it will output a table-like format which can be used as a way to calculate, for example, max, min, average and median.
Perfect, logging the results will let us validate the test steps and the measurements.
We can then decide where to store this output, or perhaps make it available as a service?
Yes, we still need to work on this part of the design. We could write the results to S3 for example but lets tackle this as a later step.
cc @nikimanoledaki @AntonioDiTuri
Hi,
Thank you @rossf7 for the suggestion. I'm not sure that the labels are exported as selectors into Prometheus. I'm currently under the impression that querying by namespace would be a quick and viable way.
I have started to put something togeher, please check https://github.com/raymundovr/sustainability-metrics/blob/main/main.go
In order to continue I'll probably need access to Prometheus, is there any chance to get that? I don't mind setting up any kind of tunnel if necessary.
What do you think of this approach?
cc @nikimanoledaki
Hi @raymundovr, as discussed this morning you have readonly access to the cluster now.
Your code looks good! When you're ready we can move it to this repo. I think we could have the go module in the root and your code in cmd/main.go
. WDYT?
For the github action we can use setup-go to install go and then run the binary. I don't think we need a container image yet. The action will also need a kubeconfig so we could use port forwarding for Prometheus or make the Prometheus API public?
I'm not sure that the labels are exported as selectors into Prometheus. I'm currently under the impression that querying by namespace would be a quick and viable way.
Yes, you're right we can't use the k8s labels. I think using the container_namespace
prometheus label as you're doing is good.
There is an another factor which is the Falco team would like 3 deployments of Falco on different nodes as Falco has multiple drivers. See https://github.com/falcosecurity/cncf-green-review-testing/issues/2
So far just a single node with the modern-ebpf driver is provisioned. We could use the instance
prometheus label which has the node name. I don't really like that approach but I can't see a better option right now.
Thanks @raymundovr and @rossf7 for moving this forward! I took a look at the code and it looks like a good starting point.
The metrics that @raymundovr selected for the moment are:
Id: "kepler_dram",
Query: (`sum by (pod_name, container_namespace)(irate(kepler_container_dram_joules_total{container_namespace=~"%s",pod_name=~".*"}[1m]))
Id: "kepler_package",
Query: (`sum by (pod_name, container_namespace (irate(kepler_container_package_joules_total{container_namespace=~"%s",pod_name=~".*"}[1m]))`, *projectNamespace),
Id: "cpu_utilization_node",
Query: (`instance:node_cpu_utilisation:rate5m{job="node-exporter", instance="%s", cluster=""} != 0`, *node),
@nikimanoledaki since you have some experience with the kepler metrics, do you think this is enough?
For a first implementation I guess it is fine to print the metrics in the output, then we can refine it later :)
Thank you @rossf7 and @AntonioDiTuri for taking a look and providing feedback. Indeed, what I'm outlning a bit here is:
What do you think @nikimanoledaki ?
Update: cleaned things a bit and added kepler_container_joules_total
metric.
The first benchmark test for Falco will be a baseline test.
We can start with a script that runs as a Cron Job in Kubernetes. In the future, we can automate this using self-hosted GitHub Action runners (see Proposal 1: Actions Runner Controller (ARC)).
Pre-requisite
clusters/projects/falco/cron.yaml
Benchmark Steps
Acceptance Criteria