Open stijndehaes opened 4 months ago
I am willing to work on this myself as I have experience with writing golang and kubernetes operator. It could be extra support needs to be added to upstream karpenter, but I am not sure what the best architecture would be
Would it be enough for Karpenter to fire metrics on the nodes that were interrupted?
Would it be enough for Karpenter to fire metrics on the nodes that were interrupted?
Sadly for our use case it doesn't. What we currently do is when a pod is being shut down we look at the node if there is a spot interrupt going on. If we only fire metrics there is no easy way to query this interactively. Currently in the log of the pod we output if there is a spot interrupt. With metrics we would need another way to visualise it.
What we currently do is when a pod is being shut down we look at the node if there is a spot interrupt going on
What about Kubernetes events? We also fire an event here alongside the metric. I'm skeptical of wanting to change our tainting logic to support an observability use-case. What if we added a condition to the NodeClaim? Would this be enough to satisfy the observability use-case?
What about Kubernetes events? We also fire an event here alongside the metric. I'm skeptical of wanting to change our tainting logic to support an observability use-case. What if we added a condition to the NodeClaim? Would this be enough to satisfy the observability use-case?
Didn't notice there are kubernetes events about disruption, I could use that! A condition in the node claim would be better, but I will see where I can get with the events to start with.
Closed the PR for now, I can always open a new for the node claim condition. I will look at that later this week and make a proposal here :)
@jonathan-innis what do you think?
The new condition could look like this:
conditions:
- lastTransitionTime: "2024-05-10T00:05:07Z"
status: "True"
type: Interrupted
Reason: "SpotInterrupt"
In the reason field we add why the node is interrupted: SpotInterrupt
, ScheduledChange
, ....
The type could just be Interrupted
.
Would this new type need to be added to the upstream karpenter project? Or can we add it in the provider-aws implementation?
@jonathan-innis just a reminder to give me some feedback :)
Description
What problem are you trying to solve?
When a node is being shut down because of a spot interrupt I want to be able to figure that out in my pod. That way we can provide the correct information on why a pod was shut down. Currently we use aws node termination handler, which adds different taints depending on why the node is being shut down. I would love to switch to Karpenter handling spot interrupt however this feature is blocking.
How important is this feature to you?
This feature is very important, providing this visibility to users is key for the platform we are building.