aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.63k stars 922 forks source link

Adding an annotation to for do-not-disrupt - during given time duration #6959

Open venkatest opened 1 week ago

venkatest commented 1 week ago

Description

What problem are you trying to solve?

We have few critical applications running on both spot and reserved instances. we are using the annotation: karpenter.sh/do-not-disrupt: "true", to prevent the pod restart due to upgrade by our central team.

Every time there is an upgrade, and the node restart is required we are asked to remove the disrupt annotation. We are doing deployment just for this, which make few stakeholders not happy.

How important is this feature to you?

Would be very nice if we have an annotation where we give that the do-no-disrupt is applicable during desired working hours. say for example:

apiVersion: apps/v1 kind: Deployment spec: template: metadata: annotations: karpenter.sh/do-not-disrupt: "true" karpenter.sh/do-not-disrupt-during: "08:00 - 18:00"

I am not sure how feasible it is. if you have any other work around or suggestion please share it.

Please note that we have already asked for upgrade during non working hours , which is not possible. any other suggestions or ideas are welcome.

Thanks in Advance.

jonathan-innis commented 1 week ago

Would be very nice if we have an annotation where we give that the do-no-disrupt is applicable during desired working hours. say for example

It sounds like disabling disruption using budgets during specific times isn't enough here? Is that because these pods are scheduled alongside other applications on a NodePool and you don't want to block the disruption of other applications?

venkatest commented 1 week ago

Thanks Jonathan for your reply.

In case of reserved instance: We can create a reserved instance and ask teams with critical applications to use only that node pool and disabling disruption using budgets during specific times would solve it, we will start using it.

In case of spot instance: Here is where it is important that only nodes where pods with do-disrupt should not restart but other nodes in the pool can. Our central team would like this, they want to get quick feedback after the upgrades the things are ok for teams [at least for non-critical app]. For critical apps they can wait for next day to know if things are fine. If we find any issues, they can fix it before it impacts critical apps. One of my colleagues also suggested that instead of giving working hours, we can also add maintain hours eg : karpenter.sh/maintenance: "18:00 – 20:00" or using cron expression

Frettarix commented 1 week ago

Hi all,

I'm part of the same team as venkatest. It might be good to provide a little bit of context to show the usecase of our request a little bit better. The Kubernetes platform we use is provided by a central team so that our teams don't have to worry about the infra and they can focus on their application and deployment. By having a centralized approach like this we try to utilize shared resources in order to lessen cost and CO2 footprint.

Teams themselves do not have access to nodepools (unless they own private nodepools, which would decrease efficiency), and only access their namespaces and the resources within. The applications they deploy may have different maintenance windows depending on the hours that they service customers. We advise teams to run their applications highly available (as is best practice).

The centralized team also does updates on the infra, and they do not want to be blocked by the teams their maintenance windows as manual interaction would prevent this centralized solution from being scalable. They want to run updates, while the teams want to be able to block them when they don't want to have any interruptions. Therefor it would be good that they can 'unblock' these updates during their official maintenance hours. This approach means we can adhere to high availability best practice, but also to not push updates during rush hours.

TLDR: Teams own applications and namespaces, not the underlying nodepools. The access they have is on their deployments. Their applications have maintenance windows and they would like to only allow updates during the maintenance windows of their application. We would like to do this on an application basis.