cncf / tag-env-sustainability

🌳🌍♻️ TAG Environmental Sustainability
https://tag-env-sustainability.cncf.io/
Apache License 2.0
238 stars 113 forks source link

[Proposal] Proof of Environmental Sustainability activities and best practices for CNCF projects #64

Closed mkorbi closed 7 months ago

mkorbi commented 1 year ago

Description

We would implement a process/approach for CNCF projects (and others) to qualify their commitment to Environmental Sustainability and give them a KPI on their hand to show improvements.

Therefore we can leverage the Software Carbon Intensity (SCI). I think we will have to define 2 SCI.

Besides the SCI we can think of a checklist out of the Green Software Patterns and how much they are adhered to by the projects where possible.

Both can be checked per release. If the project thinks of optimizations it will get visible over time.

In the beginning, the definition and KPIs tracked per project could be stored in an "ensu.md" within their repositories. Where a project has multiple components, either each will require its own definition or all together are evaluated.

In the future, we can further automate this and display the SCI value per project in a Grafana dashboard. We could add tags like the following to the projects as indicator etc.

Bildschirm­foto 2023-03-09 um 21 53 47

We could add this as entry criteria for CNCF projects when the approach is matured.

Impact

Timeline

I would like to postpone working on this until we have all deliverables for KubeCon EU. However, if this proposal founds support within the TAG, I want to introduce this idea also to the TOC and receive their feedback, before we start planning activities.

Scope

The initial scope should focus on a pragmatic approach. Therefore, we need 3-5 projects for testing before rolling it out. Also, we should work and push this topic top down, from the big graduated to the smaller incubated and sandboxed projects.

Resources

Sealjay commented 1 year ago

Hi @mkorbi - we discussed this in our https://github.com/Green-Software-Foundation/opensource-wg/issues/75 meeting today - @srini1978 of Microsoft has been working on a way to generate CarbonQL scores in an automated way

Are you interested in collaborating with us on this? FYI @jawche @seanmcilroy29 @dtoakley-tw @dtoakley

We also have a guide here: https://sci-guide.greensoftware.foundation/

mkorbi commented 1 year ago

hey @Sealjay we are looking forward to working with you on this and I'm very interested in @srini1978 approach. I joined the meeting yesterday, but no one (1-2 silent people) joined within the first 10min, maybe I have the wrong invite.

About the guide, I'm well aware. For me would be the question of whats the right approach to support the OSS projects in it. Thats why I came up with the idea to find a half way generic SCI spec to get things rolling and then in extend to do custom SCIs. Not sure if it makes sense though.

Sealjay commented 1 year ago

Perfect @mkorbi! And sorry, we moved the meeting to :15 past the hour yesterday.

So this is the CarbonQL project: https://github.com/Green-Software-Foundation/carbon-ql

I'm not sure I understand the idea of a custom SCI though - is this about defining how the score was calculated and the variables used? If so, that might be related to the SCI reporting requirements: https://github.com/Green-Software-Foundation/sci-reporting/blob/main/reporting_requirements.md

Sealjay commented 1 year ago

@mkorbi it will be at :30 past the hour next week going forward; still happy to discuss.

Automated scoring project doesn't exist yet, but is being kicked off.

mkorbi commented 1 year ago

Keeping track on here:

I had a meeting with @incertum and @jasondellaluce on how we can get started with that topic on Falco as they asked at the same time for advice on how to improve energy efficiency. Post-kubecon we will move ahead as the end of Mai will be also a new Falco release. Until then we discussed that we as TAG will proceed in working out some more details.

TheFoxAtWork commented 1 year ago

Checking in here on this - specifically best practices for CNCF projects IAW with this group's charter:

Capabilities, benchmarks, and processes to evaluate technological and architectural health of projects

Is the expectation on this issue's deliverable to be an initial guide in evaluating resource consumption for projects in a default configuration so that interested projects can receive such an evaluation from this TAG? there is a balance between projects in CNCF that can do this for things running in an environment, however there is also an outstanding need for projects to understand how they are performing and areas they could improve so that as they develop features and other capabilities, this is in the forefront of those decisions. Even starting with common trade-offs for efficiency would be beneficial.

catblade commented 1 year ago

@TheFoxAtWork This whole space is fairly new and I don't think we are anywhere near where we want to be regarding measurement. The tooling regarding networking consumption or cooling offsets for CPUs heating (cooling is 30-50% of datacenter costs) are not there. Most measurements currently have to do more with the power use of the CPUs, maybe something regarding memory use. I think stating maybe what is in existence regarding measurements (which we have some of in the landscape doc) and then continuing to expand on what capabilities exist is helpful.

Let me give an example of my concerns: The SCI repo with their published example here: https://github.com/Green-Software-Foundation/sci/blob/main/case-studies/eshoppen.md

They talk about the energy consumption being measured here: https://github.com/Green-Software-Foundation/sci/blob/main/case-studies/eshoppen.md#energy-e P[kwH] = (Power consumed by CPU or Pc Number of cores + Power consumed by Memory or Pr + Power consumed by GPU or Pg Number of GPUs)/1000 (

This slice of energy measurement, for instance, misses the networking power consumption (and "those switches can be indistinguishable from blast furnaces that happen to route packets"-not my words) and heating requirements (for everything involved-sometimes those labs sound like jet engines because of the power required to cool). If we dig into even more optimal ways we can save energy, I am also unaware of anything that measures the amount of energy required by things like crossing the UPI bus in a multi-socket system in the case that the CPU/memory/GPUs are not co-located in the same NUMA node. I'm sure there are other pieces I'm missing, like packet-processing core consumption (something that may be done as part of auxiliary functionality on the board), length of time to run a process according to its efficiency, et cetera. Additionally, I suspect that not only does the CPU Utilization not scale linearly with power consumption, but also the heat generated does not.

And none of this talks about time-to-failure, as may be discussed on this paper here (HPC has done a lot of work around the space of saving power in massively distributed systems): https://www.osti.gov/servlets/purl/1140455

(things that keep me up at night)

...

Which is a very long way of saying I worry and this is not an easy space.

TheFoxAtWork commented 1 year ago

Understood. Recommend narrowing the focus to areas where we've got something we can begin with - CPU & GPU utilization, cores, & memory. How do our projects today measure up across those categories? are there specific operations, configurations, and functions that increase or decrease those categories? What about specific functionality that could be smartly considered to reduce across those categories? Take security and event logging for instance, is it more efficient to do on-host processing or send logs off host to a central service to aggregate, analyze, process, and display? Can we educate adopters on reasonable expectations for logging to reduce consumption? i.e. logging touches these "hot spots" in sustainable computing: processes, storage, networking, detection, etc. these hot spots have other consideration to reduce their footprint, do you really need to "log all the things"? balancing why its needed, what it conveys, and other observations that convey the same value for less consumption. can logging be limited until indicators of an issue occur which in-turn trigger on-demand expanded logging?

catblade commented 1 year ago

@TheFoxAtWork You are absolutely correct. We have a tendency to measure everything and goodness gracious do we love our dashboards.

I've been advocating we partner with something like GSF with SCI in the "what" to measure, and generally how to get those metrics, and then as part of the CNCF TAG work here do more of the "how" and "minimal resource consumption" part.

I would like more scientists/industrial engineers involved. Part of my general concern is that CPUs are such a small part of the total power consumption while cooling is a larger part of that total usage. We may be optimizing for the components that are much smaller in impact over other factors.

We also have to be aware that some customer requirements, depending on how the chips work and whether the kernel scheduler looking at the current core usage causes an interrupt, will not find acceptable those measurement methods (think traffic that cares about kernel interrupts, like most things with quick packet processing). I know that K8s does not allow for core assignment like that, but there are many workarounds that let us get around that (see CMK for instance) which are being used in industry.

Sealjay commented 1 year ago

@TheFoxAtWork This whole space is fairly new and I don't think we are anywhere near where we want to be regarding measurement. [...

Let me give an example of my concerns: [...]

This slice of energy measurement, for instance, misses the networking power consumption (and "those switches can be indistinguishable from blast furnaces that happen to route packets"-not my words) and heating requirements (for everything involved-sometimes those labs sound like jet engines because of the power required to cool). If we dig into even more optimal ways we can save energy, [...]

I'd agree on networking - to note that the SCI is a living document, so please do propose PRS or issues to include networking considerations further.

It's part of a bigger landscape for us - so we include networking in our training patterns ( https://patterns.greensoftware.foundation/catalog/cloud/reduce-transmitted-data ) and it's in discussions for some of the other measurement tooling like carbonQL.

catblade commented 1 year ago

@Sealjay do you think we could get the GSF to give a presentation at our next meeting, on the 7th of June, on SCI and current efforts shaped around that?

Sealjay commented 1 year ago

Sure, I'll drop Abhishek and Henry an email (they are the chairs of the Standards WG.)

catblade commented 1 year ago

Contact me via slack (or @mkorbi or @leonardpahlke or @caradelia and we can make sure it is reflected on the agenda ahead of time.

leonardpahlke commented 1 year ago

Contact me via slack (or @mkorbi or @leonardpahlke or @caradelia and we can make sure it is reflected on the agenda ahead of time.

Please do so by dropping a message to the #tag-env-sustainability channel - so others are aware. Looking forward to it 🙌

Sealjay commented 1 year ago

I'm on a mobile without slack at the moment, but just dropped @catblade and @mkorbi an email - @Henry-WattTime is happy to support.

nikimanoledaki commented 1 year ago

There is an ongoing discussion among GSF folks about how to measure the energy consumption of networking, as raised by @catblade: https://github.com/Green-Software-Foundation/sci-guide/issues/13

That repo has a bunch of other open issues around this. It looks like the right place to start a similar investigation on how to quantify the energy consumed by cooling. There is a reference to cooling here, which points to: https://devblogs.microsoft.com/sustainable-software/how-to-measure-the-power-consumption-of-your-backend-service/, which uses the Thermal Design Point as a reference.

Echoing what @catblade said on the GSF leading "what" to measure while the TAG takes care of "how" to do that in a cloud-native context.

There are open-ended questions about handling known unknowns, mapping these, and incorporating new/evolving data points. However, it would be great to start collecting data on what we can already measure (CPU, GPU, memory), as @TheFoxAtWork said.

*SCI = (E I) + M per R** (E) - Energy consumption (I) - Emissions factors (M) - Embodied emissions data for servers

While we have some of the Energy component, it is equally challenging to collect data in a consistent way for carbon emissions (I) and embodied carbon emissions (M) in the context of the cloud. There are a few different methods for doing this. We may want to start by crowdsourcing and documenting these different methodologies. @mkorbi I remember you mentioned something around this during the project meetup at KubeCon EU?

Lastly, the Tools & Practices / GreenOps / etc WG has gathered some momentum and a group of folks who would like to contribute to this kind of technical work, but it lacks focus so we are trying to narrow down the scope. In the last meeting, we decided to shift focus to support this initiative. How about we merge this project and the WG effort?

leonardpahlke commented 1 year ago

FYI: https://github.com/falcosecurity/falco/issues/2435#issuecomment-1598567599

TheFoxAtWork commented 1 year ago

I just reviewed the proposal - this is much more narrowly scoped and looks great. I left a series of comments - mostly focusing on further refinement (there is a significantly large Level of effort in item 4 that requires a lot of front-loading to be successful).

leonardpahlke commented 1 year ago

Updated the issue description to mention the WG proposal and the current working document to collaborate on the WG charter.

mkorbi commented 1 year ago

Service Desk ticket opened for an account and hardware

mkorbi commented 1 year ago

I am currently running some testing and screwing things together, which slowly leads me to the following picture we can provide.

TAG-workflow drawio

In that case, we would have to provide:

/cc @nikimanoledaki @guidemetothemoon

immavalls commented 1 year ago

@mkorbi happy to chime in if help is needed with k6.io, as this is my squad at Grafana. For projects on k8s maybe the k6-operator can work well, not requiring a test server. On GitHub have you tried to use a docker image or a k6 GitHub action?

leonardpahlke commented 7 months ago

I will close this issue since we started the WG Green Reviews last summer which addresses this issue.