[Tracking] Collect & visualise sustainability-related metrics

nikimanoledaki commented 10 months ago

This issue aims to investigate the sustainability-related metrics that could be implemented as part of our reference architecture.

The WG has so far identified the following use cases that each require a slightly different set of metrics:

SRE Metrics

Metrics used by CNCF project maintainers to make improvements at the application level. For example, as mentioned by @incertum in the issue linked before: Falco's own internal metrics (CPU, memory, and counters), traditional SRE metrics (CPU/mem usage), and energy metrics.

More information about this can be found in the Metrics section of the Green Reviews design document.

[ ] CPU usage
- Typically measured as a percentage of one CPU, it can be compared with the number of available CPUs on the host. Falco's hot path is single-threaded, so it should not be able to exceed the capacity of one full CPU.
[ ] Memory RSS
- Resident Set Size is the portion of memory held in RAM by a process.
[ ] Memory VSZ
- Virtual Memory Size is the total memory allocated to a process, including both RAM and swap space.
[ ] container_memory_working_set_bytes in Kubernetes settings
- This is almost equivalent to the cgroups container memory_used metric natively exposed in Falco metrics.
[ ] Traffic rate
- packets/second

Sustainability Metrics

SCI score: https://github.com/cncf-tags/green-reviews-tooling/issues/33
Impact framework

Other emerging indices that can be used to assess an application's sustainability footprint may also be considered in the future.

Benchmark-Specific Metrics

Metrics to setup the benchmark tests for each CNCF Project.

[x] https://github.com/falcosecurity/cncf-green-review-testing/issues/11

These metrics are often inter-related. For example, data about energy consumption can be used in each of these scenarios.

This issue can be used to track the ideas and discussions for which metrics the Green Reviews pipeline should track. That being said, prioritisation is key so that the WG remains on track with the milestones that were set in the Roadmap by the group.

nikimanoledaki commented 9 months ago

Looking at SRE Metrics, @incertum, do you already have a Grafana dashboard for these metrics? We would need to either create Prometheus queries or access them through the Falco internal metrics.

incertum commented 9 months ago

@nikimanoledaki Falco does not yet have a Prometheus exporter, perhaps for Falco 0.38 in May we may have it, I need to check with the other maintainers. Meanwhile, we have Falco metrics as internal Falco rules that can be piped to logrotated files (JSONL formatted).

Proposing to make the CNCF SRE Metrics independent of Falco or Falco's Metrics and report CPU and memory usages of project binaries through your preferred framework as well as creating your preferred Grafana dashboards. WDYT?

nikimanoledaki commented 9 months ago

I wonder if there are any useful metrics in the default metrics of Kubernetes, for example:

from the native components
- https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/
- https://github.com/kubernetes/kubernetes/blob/master/test/instrumentation/testdata/stable-metrics-list.yaml
from kube-state-metrics (ksm) - more likely to find something here: https://github.com/kubernetes/kube-state-metrics/tree/main

It would be nice to somehow surface the internal Falco metrics that way, but I'm not sure if that would be possible since those would be logs, not metrics.

What is the filesystem location where the internal Falco metrics are exported? These metrics are at the Pod level, correct?

Which Falco Metrics would you find useful or relevant for either 1) performance monitoring or 2) setting up the benchmark tests?

Looking at this, I imagine "kernel.evt_rate" is one that we would definitely need for the benchmark tests.

AntonioDiTuri commented 9 months ago

I created two deep-dive ticket on the steps to collect the metrics and visualize them. I made a distinction between Kepler and Kubernetes related metrics which have a more standard approach and Falco that needs some more thought on the process, hope that it is clear, please let me know

cncf-tags / green-reviews-tooling