[UMBRELLA] Falco collaboration with CNCF `tag-env-sustainability`

incertum commented 1 year ago

Motivation

Falco would like to partner with https://github.com/cncf/tag-env-sustainability in order to improve Falco's efficiency (reduce compute overhead and resolve resource constraints limitations). This includes overcoming design challenges with new thinking in order to enable Falco to further extend threat detection capabilities w/ resource utilization budgets in mind.

Additional Context

EDIT Dec 19, 2023

New dedicated repo is up https://github.com/falcosecurity/cncf-green-review-testing/.
Checkout the open issues https://github.com/falcosecurity/cncf-green-review-testing/issues for tracking.

mkorbi commented 1 year ago

Hey @incertum, we would like to support you here. First we will have to define a base line so that in the future you will have a measurable outcome. I opened some days ago the matching request for that method: https://github.com/cncf/tag-env-sustainability/issues/64#issuecomment-1482001047

So we can get started here and then move on. WDTY?

Next steps would be to work out how to define the SCI for falco.

incertum commented 1 year ago

Hi @mkorbi, amazing ❤️!

SCI scores and anything related to it is new to me. Eager to learn how we can define the SCI for Falco. Previously, we focused on traditional resource utilization and health metrics (e.g. CPU and memory usage, event or event drop rates ...).

CC @falcosecurity/core-maintainers

incertum commented 1 year ago

@mkorbi Falco 0.35.0 is out featuring a new metrics option. By Falco 0.36.0 the metrics feature will transition into a stable state.

Following the discussion in https://github.com/cncf/tag-env-sustainability/issues/64, we have a few questions:

https://github.com/cncf/tag-env-sustainability/issues/64#issuecomment-1557919493 https://github.com/cncf/tag-env-sustainability/issues/64#issuecomment-1593469302 However, it would be great to start collecting data on what we can already measure (CPU, GPU, memory), as @TheFoxAtWork said.

This would benefit use cases like Falco. CPU utilization is directly tied to the rate of events collected, which can be influenced by configurations. However, it is also dependent on the workload's nature, which is beyond Falco's control. Falco now supports measuring CPU utilization, event rates, and eBPF rate of tracepoint invocations natively.

https://github.com/cncf/tag-env-sustainability/issues/64#issuecomment-1557808594 ... deliverable to be an initial guide in evaluating resource consumption for projects in a default configuration so that interested projects can receive such an evaluation from this TAG ...

What could the expected deliverables for Falco look like? One idea is to provide adopters with a mathematical equation focused on overall CPU and/or memory utilization. This equation would allow them to calculate an approximate cost and observe how the cost changes when adjusting Falco's monitoring configurations. This would enable adopters to make informed decisions about resource allocation and optimize their usage of Falco.

Adopters can choose between measuring CPU and memory of Falco separately or use Falco's native metrics feature.

In addition, Falco follows a strict badging system across its repositories. Could see benefits to including TAG Environmental Sustainability engagement badge for our project ... WDYT? This badge would recognize our commitment to promoting and incorporating sustainable practices within the Falco community.

leonardpahlke commented 1 year ago

Hey @incertum, congrats on the latest release!

As part of TAG ENV, we are establishing a working group that will first investigate and then guide future projects like Falco and other CNCF projects to track their Cloud Native Sustainability footprint from release to release. The WG charter is currently discussed, but as soon as it's up, this group will focus on this issue. cc @guidemetothemoon and @nikimanoledaki

--

Regarding your comments and questions. There are two topics we are mixing in this discussion:

First, we want to make sure we incorporate cloud native sustainability in the development of our software. This is one is focused on maintainers building the open source software. It's about reporting, possible audits at some point, and enhancing the release process (adding a badge to the repo etc…).
Secondly, we would like to enable transparency to users to check on the cloud native sustainability footprint. This is aimed at the end users of the software to best configure the project for their needs and understand the tradeoffs in configuration and overall application.

Both are important, but we should not mix it in discussions. The TAG scope overarches both. Both rely on the same metrics to make assessments. Hearing about your latest release, that features metrics, is great 👍.

The obvious next question is, which metrics we care about. That's a larger topic. And the WG will look into this more detailed. In essence, if we talk just about metrics, we care about energy usage. If the space matures further, we will care about natural resources too, but on a system level, so this would not apply to a project like Falco. Energy usage it is. We also need to investigate energy effectiveness (not just energy efficiency, but being “mindful” of energy “invested”). In most cases, we cannot measure the usage directly and need to use correlations like $ cost or map it with vCPU etc. The more accurately we can measure, the better are our estimates, right.

Let's circle back, if we “test bench” the project (first topic 1. mentioned) we have information on the system underneath. We don't have to go through Falco to measure the energy usage. We just have to record which parameters we adjust (total events, event kinds, etc.) in Falco and map it. For end users, this may not be the case since and user experience also comes into play. We may want to split this scope into two initiatives (1. & 2.) which are both related (would love to hear your thoughts @TheFoxAtWork).

Since this is the first time the TAG is working with a project to assess their cloud native sustainability footprint, I expect that this will be a great learning experience :D. I am excited!

catblade commented 1 year ago

Would there be a possibility of presenting FALCO on one of the TAG meetings, so we can learn more?

incertum commented 1 year ago

Thank you @leonardpahlke and @catblade! happy to join one of the next TAG meetings.

Meanwhile, you might want to consider exploring this proposal on kernel version testing, which offers additional insights into why a kernel monitoring tool differs from other software. One notable distinction is that resource utilization depends on the actual workload and kernel settings of adopters, both of which are unpredictable factors for Falco developers. Consequently, I agree that enabling ...

@leonardpahlke

"transparency to users to check on the cloud native sustainability footprint. This is aimed at the end users of the software to best configure the project for their needs and understand the tradeoffs in configuration and overall application."

would be particularly beneficial for Falco.

Traditional CPU and memory usages are typically top of mind for SREs. Therefore, if we could derive energy consumption from those measurements, it would be highly appreciated.

That being said, happy to investigate and gather additional or different metrics.

TheFoxAtWork commented 1 year ago

There are a few items here worth considering (and indeed Falco is a different sort of cloud native project that makes this tricky but incredibly worthwhile as a first project to explore this with) (apologies if its a bit rambly - both the points, while generally separate, are more interrelated for projects like Falco due to what they do and less on how they do it, but i'd be happy to have this proven otherwise)

This could likely be accomplished by leveraging the testing infrastructure the project has in place and plans to have in place - effectively supporting the right size for their needs. Efficient tracking of the Project in an execution environment with a few types of workloads and common kernel settings would provide good visibility for a baseline. Something like a 2x2 matrix/table to record Low and High interaction workloads and two common kernel settings (evaluated for each) is a good initial start for expressing baseline. Once a baseline is established, next steps may be looking over the ruleset to identify which rulesets are most intensive and which aren't (in testing and when running), then comparing to the value they provide adopters (the latter coming from the Falco team). After which a more concrete discussion on efficient versus valuable rules could be undertaken by the Project and potentially mark rules accordingly for adopters or update the maturity framework to include an "efficient, core-value" set.
Having Falco provide transparency in its utilization for production environments is beneficial and it gives adopters a self-service option. Potential future improvements here could be Falco recommending which rules need tuned by the adopter as they are producing excessive noise and burning utilization above an identified threshold.

Lets look at the information available to us that doesn't details a specific provider or deployment environment if we can (since utilization/consumption measurements are wildly different) and focus on how the project is developed (primarily test infrastructure) and how it is commonly deployed (harder with Falco).

Somethings I expect to have confirmed:

Security tools are going to be computationally intensive due to the kinds of interactions they monitor and the rigor by which they are executed - anything we can do to guide adopters into more eco-conscious decisions without compromising security detections will improve the current state.
There are a limited number of ways to efficiently detect all the things adopters care about and largely will vary use case to use case.

nikimanoledaki commented 1 year ago

@catblade Would there be a possibility of presenting FALCO on one of the TAG meetings, so we can learn more?

@incertum could you open a new issue using the Presentation template to do a short presentation at one of the upcoming regular meets, please? This will mainly be a discussion for TAG contributors to learn about Falco, get up to speed with the initiative discussed here, and discuss next steps.

Upcoming meets with available time include Wednesday 5th July & Wednesday 19th July. Meeting details can be found in the TAG's repo landing page. Thanks, looking forward to it! 🎉

incertum commented 1 year ago

Great, thank you! July 19th would be best.

catblade commented 1 year ago

I'll make sure to add you into the agenda this week if someone else doesn't get to it first. :-)

incertum commented 1 year ago

Updates July 19, 2023:

Here are the meeting notes https://docs.google.com/document/d/1TkmMyXJABC66NfYmivnh7z8Y_vpq9f9foaOuDVQS_Lo/edit#heading=h.5hquk4f1dn95, thanks @catblade!

Action Items on Falco side (ETA before Falco 0.36 release ~Sep 2023):

Create a test matrix, similar to Emily's suggestion https://github.com/falcosecurity/falco/issues/2435#issuecomment-1599495987
Falco project to make executive decisions on what desired benchmark test scenarios for scaling factors should look like, @catblade provided some initial pointers re possible synthetic workloads that could be of interest to us:
- https://github.com/delimitrou/DeathStarBench/blob/master/hotelReservation/README.md
- https://github.com/GoogleCloudPlatform/microservices-demo

Tracking tag-env-sustainability progress:

WG PR: https://github.com/cncf/tag-env-sustainability/pull/151/files, approx. current ETA more August or later in 2023, outcomes will guide currently open questions around guidance for adopters / desired UX to assess the utilization impact of a tool (here Falco) on their specific environments and constraints @guidemetothemoon
Getting CNCF resources on equinox clusters to host testbeds / benchmarks is in flight https://www.cncf.io/community-infrastructure-lab/ @nikimanoledaki
Kepler project (also eBPF powered) was suggested to measure consumption during benchmark tests on the dedicated testbed clusters https://sustainable-computing.io/design/power_estimation/#deployment-scenarios

incertum commented 8 months ago

Updates Dec 19, 2023: