aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.56k stars 909 forks source link

Can not properly delete NodeClass #6462

Open muckelba opened 1 month ago

muckelba commented 1 month ago

Description

Observed Behavior: When deleting a NodeClass, karpenter wants to delete the nodeclaims (Waiting on NodeClaim termination for common-xdbj9, common-vvclr, common-2ppgb) but they suddenly cant find their nodepool anymore (Cannot disrupt NodeClaim: Owning nodepool "common" not found). Karpenter just logs resolving node class, ec2nodeclasses.karpenter.sh "default" is terminating, treating as not found as soon as the deletion gets issued.

Expected Behavior: The nodeClaims delete themselfs first and then the nodeClass.

Reproduction Steps (Please include YAML): kubectl delete ec2nodeclasses.karpenter.k8s.aws default

Versions:

jmdeal commented 1 month ago

Disruption refers to voluntary disruption modes: e.g. Drift, Expiration, and Consolidation. None of these can take place when the NodePool or NodeClass does not exist, hence why Karpenter can't disrupt the NodeClaim. That doesn't mean Karpenter can't terminate the NodeClaim. Deleting the NodeClass should result in Karpenter setting a deletion timestamp on each NodeClaim associated with that NodeClass, and those NodeClaims will gracefully terminate. Graceful termination isn't bounded; blocking PDBs can prevent a NodeClaim from terminating indefinitely.

If you're able to share Karpenter logs and the NodeClaim resources we should be able to determine if Karpenter is operating correctly. If it is and you want to be able to set an upper bound on termination time, you'll probably be interested in https://github.com/kubernetes-sigs/karpenter/pull/916 which just merged in the upstream repo.

muckelba commented 1 month ago

Hey, thank you for your explanation. I just did some more testing, even without any PDBs in the cluster (except for karpenter but that's running on fargate), the nodes wont terminate.

That's everything i can find that is relating to the deletion.


How does the release process of karpenter go? There's the merge in kubernetes-sigs/karpenter and then the cloud specific providers (aws in this case) has to implement and release it too?