Closed johngmyers closed 1 year ago
This was a mistake in a recent version of the coredns eks add-on. The most recent version removes this annotation by default.
@johngmyers can you upgrade to the most recent version and see if this fixes things?
Upgrading the coredns EKS addon indeed removes that annotation.
For things other than the three base EKS addons this can probably be addressed with a gatekeeper rule preventing admission. One could even limit the scope of the rule to Deployments and StatefulSets, under the presumption that other types of Pods are likely to complete.
The disadvantage of the annotation compared to an unsatisfiable PodDisruptionBudget is that it is not known to the Kubernetes eviction API and thus won't prevent eviction by things other than Karpenter that aren't specifically coded to know about it.
The disadvantage of the annotation compared to an unsatisfiable PodDisruptionBudget is that it is not known to the Kubernetes eviction API and thus won't prevent eviction by things other than Karpenter that aren't specifically coded to know about it.
That's true. The fact that the eviction API only has to reckon with PDBs makes it simpler to understand. We're currently aligning with SIG-autoscaling on this, so that this do-not-disrupt
annotation can at least be identical across different node autoscaling solutions.
With regard to the issue title, I'm not sure if allowing a way to ignore the do-not-evict
annotation is the right path. Since the do-not-evict
annotation is a kind of pod-level override, I worry about a provisioner/global-level override to override that override. It becomes much harder to debug and reason about within a cluster.
Since the issue that you had was fixed, I'm going to close this, but feel free to ping to re-open if you still want this feature so that we can track it.
@njtran can we please reopen this issue?
This annotation basically allows application owners to override the cluster administrator without an additional way of disabling it at the cluster level or node pool level.
I don't agree that gatekeeper is a viable solution as it's just a workaround rather than fixing the source of the problem, and it enforces the cluster admin to install yet another tool in the cluster just for the sake of disabling the annotation.
Another possible solution to the issue would be to add a grace period parameter at the cluster or node pool level that would not allow the annotated pod to block the rollout pass n
seconds.
Another possible solution to the issue would be to add a grace period parameter at the cluster or node pool level that would not allow the annotated pod to block the rollout pass n seconds
I think the best discussion that captures the concern that you are raising is here: https://github.com/kubernetes-sigs/karpenter/issues/752 - Discussion around setting a grace period for karpenter.sh/do-not-disrupt
so that you can keep misconfigured or long-running pods from holding nodes open. I'd recommend adding to the discussion there. The cluster-admin role and ability to configure what application developers can do here is definitely interesting. I would ask though: How do you control this for PDBs since you could have a similar problem here? Do you manage this with policy or some other mechanism? I would expect that these problems would be roughly similar.
Thanks @jonathan-innis, #752 is indeed relevant.
You are raising a good point regarding PDBs, here are my thoughts:
kubectl
api allows to ignore PDBs, this is an example of giving the power to the cluster admin to decide if it is ok or not ok to disrupt a node: https://github.com/kubernetes/kubernetes/pull/85571PDB misconfiguration can be addressed by limiting the RBAC permissions
Actually this is a really strong point. I'm not sure how different the relevant permission are in actuality from PDB create and pod create (I imagine they are roughly similar since I feel like it would be a nightmare for a cluster admin to define these PDBs on a mega-cluster with a ton of applications with different requirements around uptime). Either way, there is at least a mechanism to manage this through RBAC, the karpenter.sh/do-not-disrupt
annotation doesn't offer the same level of dilineation.
I do not think pods created by jobs or pods not owned by a workload should be treated the same as deployment/statefulset pods, such pods are more ephemeral in nature
This one is really hard for us to say. I could see a completely different user feeling completely opposite about this based on their use-case. Realistically, I don't think that it's very common to set PDBs on jobs anyways, but I don't think we want to be in the business of ignoring the user's intent when it came to setting PDBs on their jobs if that's what they did.
The kubectl api allows to ignore PDBs, this is an example of giving the power to the cluster admin to decide if it is ok
We chose to do the same thing when it comes to karpenter.sh/do-not-disrupt
. This annotation currently doesn't block the eviction flow for any pods that have been directly force deleted.
Realistically, what I think this all boils down-to is a need for cluster admin control over how long something can be in a disrupted state combined with how long something can be draining on the cluster.
For the forceDrain idea, this is currently being discussed through an RFC here: https://github.com/kubernetes-sigs/karpenter/pull/834. Most likely, we're going to accept some form of this idea into the project here soon.
The other piece is the need to say, "This node has been Drifted for 1d and has been blocked by PDBs or the karpenter.sh/do-not-disrupt
annotation for that whole time, I'd like you to just forge ahead and ignore those things please." We don't currently have handling for that today; though, I think for someone who wants cluster-admin-level control over what their app developers can do on the cluster, it makes a lot of sense to allow this kind of behavior. Maybe something in the API like
disruption:
tolerationDuration: 1d # How long to tolerate a disruption condition existing and not being acted upon
@jonathan-innis the RFC is about force terminating a node for any reason, including PDBs which could cause downtime.
Is there going to be a separate feature for ignoring only the karpenter.sh/do-not-disrupt
annotation after n
minutes?
The issue you're talking about would be here: https://github.com/kubernetes-sigs/karpenter/issues/752
Description
What problem are you trying to solve?
We want to remediate security vulnerabilities by having Karpenter replace all of its nodes with ones using the currently-configured AMI. We want this replacement to be done without requiring manual intervention.
Unfortunately, a workload, the EKS addon for coredns, abusively places a
karpenter.sh/do-not-evict
annotation on its pods, preventing the nodes that they run on from being remediated without manual intervention.How important is this feature to you?
The need to perform a manual procedure on a per-cluster basis results in a substantial increase in operational cost.