Open GnatorX opened 8 months ago
Alternatively, it could just be a simple setting useSpot
to indicate user intents to use spot to prevent the default path of Karpenter picking spot instances even if the user isn't prepared for interruptions
Is the problem here that you have a group of users in your cluster that have access to a NodePool to add Spot as an allowed capacity type, but as an Admin, you've decided to not use the interruption queue since you don't want spot being used?
Personally, I'm not a fan of a useNodeTerminationHandler
setting, or something similar, as it adds another override that's essentially only to combat the case where I have, as an admin, made a mistake. I think it makes less sense too if the Admin is the sole controller of the NodePool too, meaning they have a global override over a NodePool field that they've set themselves.
In this case we had a user that was using Karpenter to manage their cluster that didn't know by default Karpenter allows spot to be utilize even though they haven't configured anything to handle spot interruption and they actually didn't want to use spot. That caused a bunch of workloads running on the cluster to get disrupted ungracefully. Only after the fact did they realize the default path for both node pool definition and Karpenter configurations allows utilization of spot without any checks to confirm if the user actually wanted spot to be used. This user have some familiarity with Karpenter so it seems to me that its too easy at the moment for users to accidentally run Karpenter in a dangerous configuration.
There is currently nothing even the docs that describes what an interruption queue even is. Which is extremely confusing because it didn't exist in previous versions of karpenter, now it needs to be set up, and there's no error messages or warning if you don't have an interruption queue configured properly.
We need better clarification on:
Just assuming that everyone is applying the cloudformation template and people will never have to think about the interruption queue is NOT the answer.
Description
What problem are you trying to solve? We ran into an issue where we didn't setup interruption queue nor node termination handler and we didn't restrict nodes to just on-demand. This meant we disrupted pods and nodes ungracefully when nodes were created in spot and we had nothing to handle to termination. Karpenter shouldn't allow users to do this accidentally.
Karpenter should introduce a setting
useNodeTerminationHandler
that must be set to true ifinterruptionQueue
isn't set. In the case that user did not set either settings, Karpenter won't allow spot instances to be created.I am open to changing the setting to be more clear.
How important is this feature to you?