falcosecurity / libs

libsinsp, libscap, the kernel module driver, and the eBPF driver sources
https://falcosecurity.github.io/libs/
Apache License 2.0
227 stars 162 forks source link

[TRACKING] Create a more coherent stats model for libs and consumer #1463

Closed incertum closed 4 months ago

incertum commented 11 months ago

As stated in the most recent libs stats refactor/extension PR #1433: https://github.com/falcosecurity/libs/pull/1433#issue-1955107915, I am creating this issue to track future refactors that, instead of adding more metrics that we urgently need, aim to create a new, more coherent stats model (software interface/API) across scap, lisinsp, and consumers such as Falco.

CC @jasondellaluce

incertum commented 10 months ago

@Andreagit97 answering here re https://github.com/falcosecurity/libs/pull/1433#issuecomment-1807694658

If the final goal is to expose only a Prometheus endpoint using something like https://github.com/jupp0r/prometheus-cpp, are we doing the right thing by exposing a vector of struct scap_stats_v2? At this point we could directly create Prometheus structs in sinsp and expose them to the consumers, so they can add them to the registry and expose them.

The Falco client shall continue to support (1) a simple file sink and (2) the internal rule (JSON formatted) and (3) a prometheus exporter was requested as additional option, but it's not yet on the concrete roadmap, maybe Falco 0.38?

As infrastructures / environments are very complex and vary among adopters I believe each of these option is valid and needed. Personally I have experienced that Prometheus only will not solve all of your problems when looking at monitoring very large and diverse environments holistically and the "bring your own metrics" over the internal Falco rule can shine in terms of simplicity and ease of integration into the existing alert output sink that you have to setup anyways if you use Falco (especially non Kubernetes environments).

That being said if the Prometheus struct will have advantages why not? Furthermore, I think the current scap stats v2 schema is not bad and we would be able to convert to Prometheus easily.

@Andreagit97 since you brought this up I was actually already thinking about How To for Falco. Let me attempt to share some more thoughts. On the one hand we have already decided that we want the Prometheus exporter, on the other hand we have also experienced firsthand that such items are a significant LOE and may be only needed or relevant for some time until the next trend arises (see for example the old k8s client or mesos etc) -- we would not know beforehand.

And maybe instead of a full client we can create a solution that pushes some JSON we already have up (I have to first explore options and learn what's possible).

In summary, I think we should discuss our output sinks in general before rushing into decisions as we have to maintain them as well (more maintenance overhead than our current internal metrics btw) and all we want is some custom metrics to check if Falco goes wonky, it's not our tools primary mission to emit metrics. Maybe @leogr you would have additional thoughts?

Let's also checkout some related projects:

OpenTelemetry should be on our radar as well and lastly for example the jaeger project changed approaches frequently wrt to the output sinks.

incertum commented 10 months ago

There was an additional comment by @Andreagit97 https://github.com/falcosecurity/libs/pull/1433#pullrequestreview-1717437206

First the pain of scap <-> sinsp separation that doesn't really apply to the stats / metrics anyways becomes very evident. Perhaps significant renaming can help for sure.

And in addition yes that get... method will go away in the future in favor of a better approach, likely a lean stats CPP class that integrates with scap more elegantly. First however we need to agree on a design and everyone is so busy right now. That's actually why we agreed on this intermediary patch in the Oct 2023 core maintainers meeting as we will benefit from finally having these metrics in prod around the libsinsp state rather sooner than later.

Therefore, to reiterate the patch in https://github.com/falcosecurity/libs/pull/1433 is an intermediary step only that basically extends the buffer that was initially only created for the sinsp resource utilization metrics.

Andreagit97 commented 10 months ago

Uhm got it, they are all valid points, thank you for pointing them out!

In general, I think that we need to involve more the community in these choices, the ones listed above are just my opinions but maybe they are completely wrong... one idea could be to post somewhere a question like "How would you like to collect Falco metrics? A prometheus exporter? a custom rule? or whatever" WDYT?

And in addition yes that get... method will go away in the future in favor of a better approach, likely a lean stats CPP class that integrates with scap more elegantly. First, however, we need to agree on a design and everyone is so busy right now. That's actually why we agreed on this intermediary patch in the Oct 2023 core maintainers meeting as we will benefit from finally having these metrics in prod around the libsinsp state rather sooner than later.

I'm in favor of an intermediary patch to expose new metrics what I was just saying in the PR was that IMO we need to start to have a little bit clearer what we want to expose from libs and in which form, we have many metrics in many different places and it starts to become difficult to not introducing bugs or to expose stable APIs for libs users. So the idea was, it's ok to have other metrics but let's try to find a glue between all the things we want to expose. I will try to play a little bit with the actual methods we have, to see if we can find a way to collect them all under a unique place. The aim is still to have it in 0.14.0 I will try to propose some possible approaches in the next few days. If you are on PTO I can try to lead the workflow, if you are fine with that of course. Let's see what we can do :)

incertum commented 10 months ago

[1] I always encourage collaboration with SREs to gain insights into their day-to-day approaches. For instance, writing to (log rotated) files and subsequently monitoring them for log transport options is a well-established practice in infrastructure management.

[2] Prometheus: There may be more options to support a Prometheus exporter than implementing a full Prometheus client or adding a new dependency to the project. For instance, Prometheus also supports push, not just pull, and /metrics could be exposed via atomically writing to a file with a POM-compatible string per line containing the metrics. If this approach is suitable for Falco, it would be fast, easy, and low-maintenance. Given that these potential approaches are based on some informal discussions, I would recommend conducting initial experiments.

[3]

I understand that for example scap has undergone numerous refactoring efforts this year. However, transparent plans outlining these changes were not shared or discussed beforehand. I'm curious about the rationale behind treating metrics differently in this refactoring.

Proposing that we move forward with the current PR, deploy and test it in production, and gather real-world data to inform our decision-making process. This data will help us ensure that the project's best interests are met.

I'm eager to collaborate with the team in December and beyond to develop a more cohesive stats model that aligns with our overall objectives. I recognize that the current PR involves significant cleanup of previous metrics code, which is a valuable contribution. Deferring the implementation of a comprehensive stats model until we have gathered more data will allow us to make more informed decisions and avoid overburdening this PR.

poiana commented 7 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

incertum commented 7 months ago

/remove-lifecycle stale

incertum commented 6 months ago

More ideas for possible follow up PRs by @FedeDP:

@FedeDP to address above in a follow up PR to https://github.com/falcosecurity/libs/pull/1652

FedeDP commented 6 months ago

Tackled second point here: https://github.com/falcosecurity/libs/pull/1745

incertum commented 4 months ago

/milestone 0.15.0

poiana commented 4 months ago

@incertum: The provided milestone is not valid for this repository. Milestones in this repository: [0.17.0, TBD, next-driver]

Use /milestone clear to clear the milestone.

In response to [this](https://github.com/falcosecurity/libs/issues/1463#issuecomment-2111014566): >/milestone 0.15.0 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
FedeDP commented 4 months ago

/milestone 0.17.0