falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.09k stars 875 forks source link

[TRACKING] `metrics` framework future refactors / cleanups #3194

Open incertum opened 1 month ago

incertum commented 1 month ago

Motivation

Tracking pending cleanups, refactors or additional features for the Falco internal metrics framework https://falco.org/docs/metrics/

incertum commented 1 month ago

CC @FedeDP @sgaist @leogr

incertum commented 1 month ago

Great discussions happening in https://github.com/falcosecurity/falco/pull/3140, few follow up items

leogr commented 1 month ago
  • The metrics framework should target the primary event source only, as the metrics snapshots can realistically only expose one current view, especially for Prometheus. Plugin metrics should instead be supported via the new plugin metrics support; see new(plugin_api): add plugin metrics support libs#1828
  • [ ] A consolidated and proper Falco metrics model is needed given that we now have even more outputs channels for the metrics (e.g. Prometheus)

Hey @incertum could you elaborate more on these two points?

incertum commented 1 month ago

@leogr I rewrote the text https://github.com/falcosecurity/falco/issues/3194#issuecomment-2111009270, is it more clear? happy to add more details.

leogr commented 1 month ago

Much clearer now, thank you!

Just one thought:

  • Since we can only provide one view of the metrics at a time

Why? I guess this is a current limitation, but we can fix it in the future. Am I wrong? I believe that in the long run, all data sources should be first-citizen, and it shouldn't be technically impossible to accommodate this.

incertum commented 1 month ago

Much clearer now, thank you!

Just one thought:

  • Since we can only provide one view of the metrics at a time

Why? I guess this is a current limitation, but we can fix it in the future. Am I wrong? I believe that in the long run, all data sources should be first-citizen, and it shouldn't be technically impossible to accommodate this.

We can emit multiple rules outputs or lines into the output file ( I would not do it though), but for Prometheus there is just one endpoint to scrape at a time ... IMO there should be more separate plugin specific metrics handling, something that was started in libs. Most metrics are syscalls source specific or generic (e.g. CPU and memory usages or rules counters) anyways. In a way right now I can only think of number of events as useful to be plugin / source specific in case you have multiple sources.

incertum commented 1 month ago

CC @sboschman (metrics for Falco w/ plugin only)

sboschman commented 1 month ago

From an operational point of view I like to have the falco metrics easily integrated with our metrics platform. So, I would like to thank everyone involved with exposing the falco metrics in a Prometheus compatible way.

I am not familiar with the falco code at all, so consider the following comments more as an outside view of things, not in any way directly mapping to any part of the code.

Falco metrics:

  1. General metrics; unrelated to syscall or any plugin
    • falco version
    • start_timestamp
    • ...
  2. Process resource utilization metrics; unrelated to syscall or any plugin, preferably in standard naming conform the default C library for prometheus
    • num_cpus
    • cpu_seconds_total
    • memory_bytes_total
    • memory_used_bytes
    • ...
  3. Falco rule engine metrics; syscall/plugin event source can be a labeled dimension of the time serie
    • events_processed_total{event_source="syscall"} or events_processed_total{event_source="k8saudit"}
    • rule_matches_total{event_source="syscall"} or rulle_matches_total{event_source="k8saudit"}
    • ...
  4. syscall specific metrics
    • ...
  5. custom plugin metrics; provided with the plugin API / SDK, implemented by the plugin
    • cloudtrail_xxx
    • github_repositories
    • gcpaudit_pubsub_errors_total
    • okta_xxx
    • ...

Notes:

incertum commented 1 month ago

Few more thoughts: