[Action] Repeat: do all workloads need to run constantly?

xamebax commented 4 months ago

(ticket is part of sustainable k8s practices project work)

Description

We want to explore the question of carbon benefits related to running Kubernetes workloads as Jobs instead of Pods. What architectural decisions need to be made, and what are potential savings?

Outcome

A recommendation in our working document that helps the reader make a choice on how to run their workloads, with an effort estimation (small, medium, large). Optional extra reading material with extra context if the reader's interested.

To-Do

[x] add relevant labels to this issue when possible,
[x] research if this is a worthy recommendation,
[x] if yes, write a recommendation,
[ ] share it for review, implement feedback.

Comments

Note: Only public cloud is in scope here.

(cc @JacobValdemar)

graz-dev commented 4 months ago

Hello @xamebax @leonardpahlke , I'd be happy to take on this issue. If I understand correctly, the outcome should be a set of guidelines on managing workloads sustainably on Kubernetes to help the reader make informed decisions about how much active their workloads should be. Is that correct? Can I start drafting something in the document?

xamebax commented 4 months ago

Hey @graz-dev, it's great to hear that you want to contribute to writing guidelines for running Kubernetes in a sustainable way! ✨♻️

Yes, you understand that correctly: we know we can run workloads on Kubernetes constantly (as Pods) or temporarily (as Jobs). Not all Pods actually have to run all of the time, and the time they spend being idle is wasting resources. We want to help Kubernetes users understand:

When is it best to switch to a Job? What kind of workloads would fit this best?
What kind of environmental benefits (in the form of resources like CPU and RAM saved, for example) can we estimate when we make such a switch?
How do we decide if it's worth it?
How big of an effort would it be?

@mkorbi I think it was an idea you mentioned originally. Is there anything you can add here? @JacobValdemar what do you think?

graz-dev commented 4 months ago

@xamebax thank you for your answer! ✨ What I have to do is quite clear and I think I can start in these days to jot down a draft in the document. Do we have a deadline for that task?

Talking about idle workloads I think that is worth mentioning the possibillity to turn off (scale to 0) the known unused workloads (such as demo or test environment in the weekends or during the night) and there start them again in a scheduled way using a simple operator as kube-green is.

kube-green is already a CNCF listed project and I think that is in scope with this task so if you agree with me I would like to mention in the paper some example usecase and how to get the most from this tool.

Thank you guys! 🌱

xamebax commented 4 months ago

@graz-dev Fantastic 🙌🏼 There is no deadline. :) You can spend as much time as you want/can on this. Once you feel you have something you want to share, mention it here and other people can read it and offer feedback.

It's a good idea to mention kube-green, thank you for dropping it! I think it's worth looking into, I'm just not sure if it fits into the scope of this particular recommendation (repeated vs continuously running workloads).

What we have tried to do is to break down the guidelines into actionable, small chunks (write one guideline at a time), so that they are easier to create. In the working document's Ideas scratchpad we listed some ideas we thought would be worth looking into. Kube-green sounds like a good candidate to talk about in the context of workload rightsizing. Creating a ticket about writing that particular recommendation is on my to-do list, so it should appear soon. :)

akyriako commented 4 months ago

Hi folks I can't wait to read the final recommendations of this task! If that would be of any help for you, you could have a look at the project rekuberate-io/sleepcycles, which is similar to kube-green but it covers a broader range of Kubernetes resources: Deployments, CronJobs, StatefulSets and HorizontalPodAutoscalers. (I want to state that I am the maintainer of it, so there are no misundestandings)

I find the work that is taking place in this tag amazing and I would love to be involved or even just be helpful.

Cheers!

graz-dev commented 4 months ago

Thank you @xamebax! Just to be sure, can I jot down a draft in this document?

I'll start working on it today!🚀

xamebax commented 4 months ago

@graz-dev awesome! 🙌 Yes, that is the document. 😄 ✨

@akyriako Hello! Thank you so much for your comment and your recommendation of sleepcycles! 😴 It's so nice that you want to be involved in crafting best practices for sustainable Kubernetes ♻ 😄 In #347 you'll find details on the project (the scope, the goals), and links to a working document where we gather ideas to look into and the recommendations themselves. If any idea sounds interesting to you and you'd like to work on it, do give us a shout out and we can help you get started!

I created https://github.com/cncf/tag-env-sustainability/issues/392 to focus on workload rightsizing and mentioned both kube-green and sleepcycles as possible solutions. 🙂 Thank you both, @graz-dev and @akyriako for your suggestions on this topic. 🙂

akyriako commented 4 months ago

Hi @xamebax, thanks for the orientation tips and including sleepcycles in your assessment. I will definitely join in person one of the upcoming Monday meetings.

Wish y'all a nice start of the week.

graz-dev commented 4 months ago

Hi, just a quick update about this issue. I'm currently working on this issue offline (this means that the first draft currently lives in my notes) I think I will be able to upload the first draft between tomorrow and wednesday.

I will update the issue when the first draft is ready in the working document. In the mean time, should I have to update any state in the project for this issue?

Thank u.

graz-dev commented 4 months ago

Hello @xamebax @mkorbi @JacobValdemar! I've sketched the first draft in the document 🚀 I'm not completely sure about the outcome of this draft; I tried to follow the outline of the points, but I'm not sure how much detail was intended. In any case, please leave any suggestions or comments, and I will take care of resolving them.

However, since the request was for each block to be self-contained, I thought of adding a very simple section on the general management of workloads in K8s that could be extended to the entire "Refuse, reduce, resize, reschedule, repeat, repair" block.

Let me know what you think. Also, if possible, I'd like to understand which project this issue fits into, what the final outcome will be, and how I can contribute to this project and the WG in general.

Thank you! 🌱

JacobValdemar commented 3 months ago

Thank you for your contribution @graz-dev! I just read the beginning and it looks so good 🤩

You asked some questions about how/where you can contribute to the project. I'll try to answer that. Please let me know if the information is too "basic" or if you were seeking something else (I would rather give too much information than too little). This GitHub Issue is an action item in the Best practices for environmentally sustainable Kubernetes clusters project which is described in https://github.com/cncf/tag-env-sustainability/issues/347. Formally, the project is organized within the Cloud Native Computing Foundation's (CNCF) Technical Advisory Group (TAG) for Environmental Sustainability (CNCF TAG ENV for short). You can read more about TAG ENV on our website and in this GitHub repository.

The project has a regular meeting every other week which you can find in the TAG ENV calendar. We would love to see you at our next meeting on Monday, 6th May, at 13:00 CEST if you have time and are interested (meeting link in the TAG ENV calendar). You can find our meeting notes and agenda in this Google Docs document. If online meetings is not for you, then you can also communicate with us in our Slack channel: #tag-env-k8s-best-practices in the CNCF Slack workspace. TAG ENV also has a Slack channel: #tag-environmental-sustainability.

When it comes to tasks that you can do, we have a GitHub Project that you can check out to get an overview. It doesn't contain everything, as we are just getting started. Some other items you can work on are in the "Ideas Scratchpad" section of the Working Document that you added your work to. And finally, if there is anything that you believe the project should include, which we haven't considered, then we would love to hear it!

graz-dev commented 3 months ago

Hi @JacobValdemar thank you for your answer! I'd be happy to participate in future meetings to get to know each other better and clarify any points verbally as well. If I can, I'll look into the other issues tonight and maybe start working on the rightsizing one (https://github.com/cncf/tag-env-sustainability/issues/392) that originated from a proposal here. Meanwhile, I wanted to ask if you already have a clear idea of what the deliverable for this project would be: a white paper? A blog post? or something else? Thanks again!

JacobValdemar commented 3 months ago

@graz-dev the goals and deliverables for this project are described in the Project Tracking issue (#347):

Goals and Non Goals

Goals

Enabling Kubernetes administrators to run more sustainable clusters and identify the best actions to lower their clusters carbon intensity / carbon footprint.

Summarize and vet available information. CNCF is a trustworthy source of information in the cloud-native landscape.

Non Goals

We do not have an ambition for a complete, exhaustive list of all actions, as we will try to focus on the highest impact, lowest effort actions (where possible).

Deliverables

a guide / actionable list of best practices,

extra material: background information.

We want to provide actions and optional extra context for operators with extra capacity. We are coming from an understanding of tight time budgets that can be spent on sustainability efforts.

As noted in the meeting notes, we are considering if we should also publish the result as a whitepaper, but we have not made a decision about that yet ☺️

graz-dev commented 3 months ago

Hi guys, for me the content for this issue is completelly submitted. Let me know if you have some suggestions or improvements! 🚀

JacobValdemar commented 3 months ago

@xamebax Can you review the content? Maybe there is something you would like to add or change?

xamebax commented 3 months ago

@JacobValdemar @graz-dev I am going to read everything this Tuesday (tomorrow). Thank you for your patience! 💚 🎉

graz-dev commented 3 months ago

Don't worry @xamebax! Let' me know if the document need some changes or if some parts needs more details.

Thank u!🚀

xamebax commented 3 months ago

I read the submission @graz-dev! 🎉 🙌🏼 I think that it's written in clear language and feels quite accessible. I like that you add links to Kubernetes documentation, and that the whole text is self-contained. This is the first recommendation that's actually written, so there was no previous text that could help with gauging length, level of detail, or tone of voice.

As I was reading, I made some minor suggestions to punctuation and grammar - if you agree with them, you can accept them in the document. I also added a comment about putting numbers on resources potentially saved in the working doc.

I like that you gave a practical example around running synchronous/asynchronous jobs.

I'm thinking about gauging the effort one would have to put in to evaluate and implement this recommendation: how much work would it be to switch from a long-running Pod to a periodic CronJob? To put it short, changing from a Deployment to a Job or CronJob will often require changes to the application itself. The amount of work there will depend on a number of factors and thus might be hard to have an opinion on without knowing any details. What do y'all think? Is this something worth drilling into, or is it too abstract to estimate?

@mkorbi Would you have time to read this recommendation? I think your feedback would be appreciated. 🙂

And a side note: reading this, I think we might want to consider adding visual aids in the end. I'll make a note to revisit this when we're closer to the finish line.

graz-dev commented 3 months ago

Hi @xamebax, thank you so much for your comments. I'll review the document tomorrow.

Regarding the insights on the necessary effort to transition from a Deployment to a Job, I think estimating it might be a bit challenging as it heavily depends on the specific case. I can attempt to hypothesize an estimate based on the example I provided in the document, but I believe the key point here is to emphasize that transitioning between workload types is not an effortless task and should be carefully evaluated. While this should already be stated in the document, perhaps the point isn't coming across clearly, so I could emphasize it further.

I'll also try adding some simple diagrams (using excalidraw) to visually support the main concepts of each paragraph. What do you think?

Anyway, I believe that the introduction, briefly outlining workload management in Kubernetes, could serve as the introduction to the entire artifact. From there, certain concepts can be taken for granted, eliminating the need to repeat them. At this point, I'm inclined (but it's just personal idea) to consider the idea of a whitepaper.

Thank u all! ✌🏻

cncf / tag-env-sustainability