[TRACKING] `metrics` framework future refactors / cleanups

incertum commented 1 month ago

Motivation

Tracking pending cleanups, refactors or additional features for the Falco internal metrics framework https://falco.org/docs/metrics/

incertum commented 1 month ago

CC @FedeDP @sgaist @leogr

incertum commented 1 month ago

[ ] make libs_metrics_collector static https://github.com/falcosecurity/falco/pull/3192#discussion_r1599855210
[ ] sanitize_metric_name according to the Open Metric standard was introduced in Falco. Perhaps should be pushed as well to the libs metrics collector.

Great discussions happening in https://github.com/falcosecurity/falco/pull/3140, few follow up items

[ ] Falco's wrapper metrics num_evts is still missing in the Prometheus output since it requires greater code refactors.
[ ] Falco can run with a single event source (either syscalls or one plugin source) or with multiple event sources. Initially, the goal was to have metrics work with either syscalls or a plugin source. However, challenges arise when dealing with two or more event sources. For example, should the metrics display the number of events for the syscalls source or the plugin source? What should be the value of evt.source? Since we can only provide one view of the metrics at a time, instead of adding nested fields or implementing another solution, Falco metrics should focus on the syscalls or the primary plugin source only. When running Falco with syscalls and one plugin, the new plugin metrics API should be used to retrieve plugin metrics in addition to the syscalls metrics.
- [ ] see https://github.com/falcosecurity/libs/pull/1828
- [ ] multi-platform metrics support https://github.com/falcosecurity/libs/pull/1870, plus see https://github.com/falcosecurity/falco/issues/2821 still needs fixing
[ ] libs now includes a new metrics collector class that consolidates metrics across the libs codebase. Falco needs a similar solution. In https://github.com/falcosecurity/falco/pull/3140, @sgaist referred to it as a "proper Falco metrics model," especially since we now have more output channels for metrics (e.g., Prometheus, web server, output rule, output file). The goal is to simplify the codebase and reduce code duplication (e.g. see code duplication and fragmentation in falco_metrics.cpp, stats_writer.cpp, stats_manager.cpp).

leogr commented 1 month ago

The metrics framework should target the primary event source only, as the metrics snapshots can realistically only expose one current view, especially for Prometheus. Plugin metrics should instead be supported via the new plugin metrics support; see new(plugin_api): add plugin metrics support libs#1828

[ ] A consolidated and proper Falco metrics model is needed given that we now have even more outputs channels for the metrics (e.g. Prometheus)

Hey @incertum could you elaborate more on these two points?

incertum commented 1 month ago

@leogr I rewrote the text https://github.com/falcosecurity/falco/issues/3194#issuecomment-2111009270, is it more clear? happy to add more details.

leogr commented 1 month ago

Much clearer now, thank you!

Just one thought:

Since we can only provide one view of the metrics at a time

Why? I guess this is a current limitation, but we can fix it in the future. Am I wrong? I believe that in the long run, all data sources should be first-citizen, and it shouldn't be technically impossible to accommodate this.

incertum commented 1 month ago

Much clearer now, thank you!

Just one thought:

Since we can only provide one view of the metrics at a time

Why? I guess this is a current limitation, but we can fix it in the future. Am I wrong? I believe that in the long run, all data sources should be first-citizen, and it shouldn't be technically impossible to accommodate this.

We can emit multiple rules outputs or lines into the output file ( I would not do it though), but for Prometheus there is just one endpoint to scrape at a time ... IMO there should be more separate plugin specific metrics handling, something that was started in libs. Most metrics are syscalls source specific or generic (e.g. CPU and memory usages or rules counters) anyways. In a way right now I can only think of number of events as useful to be plugin / source specific in case you have multiple sources.

incertum commented 1 month ago

CC @sboschman (metrics for Falco w/ plugin only)

sboschman commented 1 month ago

From an operational point of view I like to have the falco metrics easily integrated with our metrics platform. So, I would like to thank everyone involved with exposing the falco metrics in a Prometheus compatible way.

I am not familiar with the falco code at all, so consider the following comments more as an outside view of things, not in any way directly mapping to any part of the code.

Falco metrics:

General metrics; unrelated to syscall or any plugin
- falco version
- start_timestamp
- ...
Process resource utilization metrics; unrelated to syscall or any plugin, preferably in standard naming conform the default C library for prometheus
- num_cpus
- cpu_seconds_total
- memory_bytes_total
- memory_used_bytes
- ...
Falco rule engine metrics; syscall/plugin event source can be a labeled dimension of the time serie
- events_processed_total{event_source="syscall"} or events_processed_total{event_source="k8saudit"}
- rule_matches_total{event_source="syscall"} or rulle_matches_total{event_source="k8saudit"}
- ...
syscall specific metrics
- ...
custom plugin metrics; provided with the plugin API / SDK, implemented by the plugin
- cloudtrail_xxx
- github_repositories
- gcpaudit_pubsub_errors_total
- okta_xxx
- ...

Notes:

Process metrics (2) are not Falco specific, any application/process should be able to provide these metrics in a standard way. If you are familiar with these standard metrics, you can easily apply your existing knowledge to any application/process. E.g. for Golang and Java we even have default dashboards for this base set of metrics.
Falco rule engine metrics (3) are dimensioned by event source. An overall total can easily be calculated by the metrics platform, e.g. with PromQL sum without(event_source) (events_processed_total{}) and has not to be explicitly exposed by Falco
I realise syscall is the original falco event source and the plugin framework, and support for other event sources, has been implemented later. As of 0.35 plugins can also output syscall events, so to me 'falco drivers' and 'plugins' are just a way to provide event input to the falco rule engine and syscall is just one of the event sources. Hence the metrics being split into items 3, 4 and 5.
Different plugins can provide the same event source, e.g. the k8saudit, k8saudit_eks and k8saudit_gke plugin all provide the k8s_audit event source. So (5) are plugin specific metrics, not event source specific metrics.

incertum commented 1 month ago

Few more thoughts:

@mrgian is working on exposing (5), see PRs linked to above https://github.com/falcosecurity/falco/issues/3194#issuecomment-2111009270
As we explained earlier most of the current hiccups are because of a very complicated code refactor in the scap module that broke many of the metrics you listed under (2) when running Falco with a plugin only. @mrgian is also working on that, but it's a bit of a larger refactor. Meanwhile, https://github.com/falcosecurity/cncf-green-review-testing/discussions/14#discussioncomment-8610132 CPU and memory usages can be consumed externally, but as said we are working on fixing the Falco native support for that as well.

falcosecurity / falco

[TRACKING] `metrics` framework future refactors / cleanups #3194