[Tracking/Action] Repair: how broken Kubernetes workloads lead to higher emissions

xamebax commented 6 months ago

(ticket is part of sustainable k8s practices project work)

Description

What is the carbon cost of leaving broken workloads to run on Kubernetes? What is the untapped potential of making sure workloads repair themselves better, or that broken workloads aren't allowed to run for a long time? Is there a good "Kubernetes hygiene" around repairing workloads that can lead to lowering a cluster's carbon cost?

Outcome

A recommendation in our working document that helps the reader make a choice on how to repair their workloads, with an effort estimation (small, medium, large). Optional extra reading material with extra context if the reader's interested.

To-Do

[x] add relevant labels to this issue when possible,
[ ] research if this is a worthy recommendation,
[ ] if yes, write a recommendation,
[ ] share it for review, implement feedback.

Comments

Only public cloud is in scope here.
I'm gonna work on writing this recommendation. 🙂

@mkorbi I'd love your input on this issue description, do you feel this captures the fullness of what we talked about?

(cc @JacobValdemar)

xamebax commented 4 months ago

Just an update that I did start working on this and should hopefully have a draft by the end of the week.

mkorbi commented 4 months ago

It's relevant to help the reader to identify broken workload and we have to differentiate here. You have sprawls, so workload that got "lost" and no one takes care about, and you have idle workload but that "misbehaves".

I think for both there is a fairly easy approach: compare the network traffic vs. the resource consumption ->

no traffic but continuous "high" consumption, something is wrong

There are also other use cases where for example you either have old programming languages false configuration and those demand to much resources.

xamebax commented 3 months ago

@mkorbi thanks for extra context. So there are two repair paths:

the ones internal to Kubernetes:
- can we make any gains by setting a specific restartPolicy?
- liveness and readiness checks: we cat make sure the liveness checks we define for Pods are actually measuring what we think they are
- something else?
the ones where we need to have a good understanding that liveness/readiness checks are not enough and we should rather use external metrics that can alert us about broken workloads:
- use request rate and latency metrics exposed by the ingress (for example nginx) together with resource usage metrics to hint at unhealthy ratios
- other use cases:
  - running outdated or misconfigured software

I would argue choices around programming language / outdated software is outside of the scope of this project in its first run, since if I remember correctly we agreed to not cross the Pod barrier.

Does this make sense?

xamebax commented 3 months ago

I wrote a high-level introduction to this in the working document. It's difficult for me to gauge if the level of detail is ok. The next step will be describing examples.

The challenge for me here is that it's quite difficult to be specific about carbon cost since workloads can be very different, so I am hoping to provide a few high-level examples.

This is very, very much a work in progress so all feedback is more than welcome, good or bad. 🙂

leonardpahlke commented 3 months ago

[...] Does this make sense?

&

[...] This is very, very much a work in progress so all feedback is more than welcome, good or bad. 🙂

cc @mkorbi @saiyam1814

saiyam1814 commented 3 months ago

I will have a look

Founder, Kubesimplify https://saiyampathak.com/youtube https://www.linkedin.com/in/saiyampathak/ https://twitter.com/saiyampathak

On Sun, 16 Jun 2024 at 4:58 PM, Leonard Vincent Simon Pahlke < @.***> wrote:

[...] Does this make sense?

&

[...] This is very, very much a work in progress so all feedback is more than welcome, good or bad. 🙂

cc @mkorbi https://github.com/mkorbi @saiyam1814 https://github.com/saiyam1814

— Reply to this email directly, view it on GitHub https://github.com/cncf/tag-env-sustainability/issues/365#issuecomment-2171449464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6PRIVRYJG6T63ZGVKN45TZHVZFTAVCNFSM6AAAAABFVBZKUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGQ2DSNBWGQ . You are receiving this because you were mentioned.Message ID: @.***>

cncf / tag-env-sustainability