Spot interrupt taint/label/annotation on node

aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.

https://karpenter.sh

Apache License 2.0

6.63k stars 923 forks source link

Spot interrupt taint/label/annotation on node #6103

Open stijndehaes opened 4 months ago

stijndehaes commented 4 months ago

Description

What problem are you trying to solve?

When a node is being shut down because of a spot interrupt I want to be able to figure that out in my pod. That way we can provide the correct information on why a pod was shut down. Currently we use aws node termination handler, which adds different taints depending on why the node is being shut down. I would love to switch to Karpenter handling spot interrupt however this feature is blocking.

How important is this feature to you?

This feature is very important, providing this visibility to users is key for the platform we are building.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

stijndehaes commented 4 months ago

I am willing to work on this myself as I have experience with writing golang and kubernetes operator. It could be extra support needs to be added to upstream karpenter, but I am not sure what the best architecture would be

engedaam commented 4 months ago

Would it be enough for Karpenter to fire metrics on the nodes that were interrupted?

stijndehaes commented 4 months ago

Would it be enough for Karpenter to fire metrics on the nodes that were interrupted?

Sadly for our use case it doesn't. What we currently do is when a pod is being shut down we look at the node if there is a spot interrupt going on. If we only fire metrics there is no easy way to query this interactively. Currently in the log of the pod we output if there is a spot interrupt. With metrics we would need another way to visualise it.

jonathan-innis commented 4 months ago

What we currently do is when a pod is being shut down we look at the node if there is a spot interrupt going on

What about Kubernetes events? We also fire an event here alongside the metric. I'm skeptical of wanting to change our tainting logic to support an observability use-case. What if we added a condition to the NodeClaim? Would this be enough to satisfy the observability use-case?

stijndehaes commented 4 months ago

What about Kubernetes events? We also fire an event here alongside the metric. I'm skeptical of wanting to change our tainting logic to support an observability use-case. What if we added a condition to the NodeClaim? Would this be enough to satisfy the observability use-case?

Didn't notice there are kubernetes events about disruption, I could use that! A condition in the node claim would be better, but I will see where I can get with the events to start with.

Closed the PR for now, I can always open a new for the node claim condition. I will look at that later this week and make a proposal here :)

stijndehaes commented 4 months ago

@jonathan-innis what do you think?

The new condition could look like this:

conditions:
- lastTransitionTime: "2024-05-10T00:05:07Z"
   status: "True"
   type: Interrupted
   Reason: "SpotInterrupt"

In the reason field we add why the node is interrupted: SpotInterrupt, ScheduledChange, .... The type could just be Interrupted.

Would this new type need to be added to the upstream karpenter project? Or can we add it in the provider-aws implementation?

stijndehaes commented 2 months ago

@jonathan-innis just a reminder to give me some feedback :)