We run the GitHub Actions self-hosted runners on spot instances. When a spot interruption occurs, Karpenter evicts a runner Pod to a new Node. It wastes our EC2 cost, because the runner Pod is not re-runnable.
That is,
The controller (actions-runner-controller) creates a runner Pod.
A spot interruption is occurred in AWS.
Karpenter evicts the runner Pod to a new Node. This may launch an EC2 instance. 💰
A new runner Pod is started but finally exited with an error that is not re-runnable.
I'm not entirely against this, but we'd need an RFC to know how this might be implemented/configured. Do you have any thoughts on how you'd best want to do that? Are you willing to write an RFC for this?
Description
What problem are you trying to solve?
We run the GitHub Actions self-hosted runners on spot instances. When a spot interruption occurs, Karpenter evicts a runner Pod to a new Node. It wastes our EC2 cost, because the runner Pod is not re-runnable.
That is,
It would be nice if a NodePool supports cordon-only mode instead of eviction. I found the related issue https://github.com/aws/karpenter-provider-aws/issues/3604.
How important is this feature to you?
This feature reduces our EC2 cost, because no new instance is launched upon a spot interruption.