Configuration option to ignore the `do-not-evict` annotation

johngmyers commented 1 year ago

Description

What problem are you trying to solve?

We want to remediate security vulnerabilities by having Karpenter replace all of its nodes with ones using the currently-configured AMI. We want this replacement to be done without requiring manual intervention.

Unfortunately, a workload, the EKS addon for coredns, abusively places a karpenter.sh/do-not-evictannotation on its pods, preventing the nodes that they run on from being remediated without manual intervention.

How important is this feature to you?

The need to perform a manual procedure on a per-cluster basis results in a substantial increase in operational cost.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

mikestef9 commented 1 year ago

This was a mistake in a recent version of the coredns eks add-on. The most recent version removes this annotation by default.

njtran commented 1 year ago

@johngmyers can you upgrade to the most recent version and see if this fixes things?

johngmyers commented 1 year ago

Upgrading the coredns EKS addon indeed removes that annotation.

For things other than the three base EKS addons this can probably be addressed with a gatekeeper rule preventing admission. One could even limit the scope of the rule to Deployments and StatefulSets, under the presumption that other types of Pods are likely to complete.

The disadvantage of the annotation compared to an unsatisfiable PodDisruptionBudget is that it is not known to the Kubernetes eviction API and thus won't prevent eviction by things other than Karpenter that aren't specifically coded to know about it.

njtran commented 1 year ago

The disadvantage of the annotation compared to an unsatisfiable PodDisruptionBudget is that it is not known to the Kubernetes eviction API and thus won't prevent eviction by things other than Karpenter that aren't specifically coded to know about it.

That's true. The fact that the eviction API only has to reckon with PDBs makes it simpler to understand. We're currently aligning with SIG-autoscaling on this, so that this do-not-disrupt annotation can at least be identical across different node autoscaling solutions.

With regard to the issue title, I'm not sure if allowing a way to ignore the do-not-evict annotation is the right path. Since the do-not-evict annotation is a kind of pod-level override, I worry about a provisioner/global-level override to override that override. It becomes much harder to debug and reason about within a cluster.

Since the issue that you had was fixed, I'm going to close this, but feel free to ping to re-open if you still want this feature so that we can track it.

dudicoco commented 11 months ago

@njtran can we please reopen this issue?

This annotation basically allows application owners to override the cluster administrator without an additional way of disabling it at the cluster level or node pool level.

I don't agree that gatekeeper is a viable solution as it's just a workaround rather than fixing the source of the problem, and it enforces the cluster admin to install yet another tool in the cluster just for the sake of disabling the annotation.

Another possible solution to the issue would be to add a grace period parameter at the cluster or node pool level that would not allow the annotated pod to block the rollout pass n seconds.

jonathan-innis commented 11 months ago

Another possible solution to the issue would be to add a grace period parameter at the cluster or node pool level that would not allow the annotated pod to block the rollout pass n seconds

I think the best discussion that captures the concern that you are raising is here: https://github.com/kubernetes-sigs/karpenter/issues/752 - Discussion around setting a grace period for karpenter.sh/do-not-disrupt so that you can keep misconfigured or long-running pods from holding nodes open. I'd recommend adding to the discussion there. The cluster-admin role and ability to configure what application developers can do here is definitely interesting. I would ask though: How do you control this for PDBs since you could have a similar problem here? Do you manage this with policy or some other mechanism? I would expect that these problems would be roughly similar.

dudicoco commented 10 months ago

Thanks @jonathan-innis, #752 is indeed relevant.

You are raising a good point regarding PDBs, here are my thoughts:

There are two main reasons that a PDB may block a node from draining: misconfiguration and unhealthy pods
PDB misconfiguration can be addressed by limiting the RBAC permissions of who is allowed to create/update PDBs in the cluster
Unhealthy pods in a deployment/statefulset should stop a node from draining as it could cause downtime
I do not think pods created by jobs or pods not owned by a workload should be treated the same as deployment/statefulset pods, such pods are more ephemeral in nature
The kubectl api allows to ignore PDBs, this is an example of giving the power to the cluster admin to decide if it is ok or not ok to disrupt a node: https://github.com/kubernetes/kubernetes/pull/85571

jonathan-innis commented 10 months ago

PDB misconfiguration can be addressed by limiting the RBAC permissions

Actually this is a really strong point. I'm not sure how different the relevant permission are in actuality from PDB create and pod create (I imagine they are roughly similar since I feel like it would be a nightmare for a cluster admin to define these PDBs on a mega-cluster with a ton of applications with different requirements around uptime). Either way, there is at least a mechanism to manage this through RBAC, the karpenter.sh/do-not-disrupt annotation doesn't offer the same level of dilineation.

I do not think pods created by jobs or pods not owned by a workload should be treated the same as deployment/statefulset pods, such pods are more ephemeral in nature

This one is really hard for us to say. I could see a completely different user feeling completely opposite about this based on their use-case. Realistically, I don't think that it's very common to set PDBs on jobs anyways, but I don't think we want to be in the business of ignoring the user's intent when it came to setting PDBs on their jobs if that's what they did.

The kubectl api allows to ignore PDBs, this is an example of giving the power to the cluster admin to decide if it is ok

We chose to do the same thing when it comes to karpenter.sh/do-not-disrupt. This annotation currently doesn't block the eviction flow for any pods that have been directly force deleted.

Realistically, what I think this all boils down-to is a need for cluster admin control over how long something can be in a disrupted state combined with how long something can be draining on the cluster.

For the forceDrain idea, this is currently being discussed through an RFC here: https://github.com/kubernetes-sigs/karpenter/pull/834. Most likely, we're going to accept some form of this idea into the project here soon.

The other piece is the need to say, "This node has been Drifted for 1d and has been blocked by PDBs or the karpenter.sh/do-not-disrupt annotation for that whole time, I'd like you to just forge ahead and ignore those things please." We don't currently have handling for that today; though, I think for someone who wants cluster-admin-level control over what their app developers can do on the cluster, it makes a lot of sense to allow this kind of behavior. Maybe something in the API like

disruption:
  tolerationDuration: 1d # How long to tolerate a disruption condition existing and not being acted upon

dudicoco commented 10 months ago

@jonathan-innis the RFC is about force terminating a node for any reason, including PDBs which could cause downtime. Is there going to be a separate feature for ignoring only the karpenter.sh/do-not-disrupt annotation after n minutes?

njtran commented 10 months ago

The issue you're talking about would be here: https://github.com/kubernetes-sigs/karpenter/issues/752

aws / karpenter-provider-aws

Configuration option to ignore the `do-not-evict` annotation #4662

Description