NodeClaims stranded even after NodePool deletion

k24dizzle commented 2 months ago

Description

Observed Behavior:

Delete NodePool

Some NodeClaims (associated with this NodePool) are marked for deletion, but stuck in this state for hours:

Name:         infra-t9kvm
Namespace:    
...
API Version:  karpenter.sh/v1
Kind:         NodeClaim
Metadata:
Creation Timestamp:             2024-08-31T01:00:37Z
Deletion Grace Period Seconds:  0
Deletion Timestamp:             2024-08-31T01:09:30Z
Finalizers:
karpenter.sh/termination
Generate Name:  infra-
Generation:     2
Owner References:
API Version:           karpenter.sh/v1
Block Owner Deletion:  true
Kind:                  NodePool
Name:                  infra
UID:                   ...
Resource Version:        27623
UID:                     ...
...
Events:
Type    Reason             Age                 From       Message
----    ------             ----                ----       -------
Normal  Unconsolidatable   59m                 karpenter  Can't replace with a cheaper node
Normal  DisruptionBlocked  48s (x29 over 56m)  karpenter  Cannot disrupt NodeClaim: state node is marked for deletion

The karpenter controller doesn't seem to have any logs that signal it is attempting to terminate the node/delete the node claim

The node itself doesn't have any PDB's that would prevent deletion.

Name:               ...
...
Non-terminated Pods:          (3 in total)
Namespace                   Name                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
---------                   ----                ------------  ----------  ---------------  -------------  ---
kube-system                 aws-node-w6v4x      50m (5%)      0 (0%)      0 (0%)           0 (0%)         3h4m
kube-system                 kube-proxy-lz2bw    100m (10%)    0 (0%)      0 (0%)           0 (0%)         3h4m
loki                        promtail-xgpdr      0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h4m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource           Requests    Limits
--------           --------    ------
cpu                150m (15%)  0 (0%)
memory             0 (0%)      0 (0%)
ephemeral-storage  0 (0%)      0 (0%)
hugepages-1Gi      0 (0%)      0 (0%)
hugepages-2Mi      0 (0%)      0 (0%)
Events:
Type    Reason             Age                    From       Message
----    ------             ----                   ----       -------
Normal  DisruptionBlocked  3m47s (x86 over 173m)  karpenter  Cannot disrupt Node: state node is marked for deletion

Expected Behavior:

Once the node pool is deleted, given that there are no pending pods, attempt to delete all the old nodes associated with the node pool

Reproduction Steps (Please include YAML):

Versions:

Chart Version: v1.0.0
Kubernetes Version (kubectl version): 1.29.7
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

engedaam commented 2 months ago

Are the underlaying nodes deleted?

k24dizzle commented 2 months ago

Are the underlaying nodes deleted?

omni eks-node-viewer --resources cpu --extra-labels karpenter.sh/nodepool --node-sort karpenter.sh/nodepool

The nodes still exist, but stuck in Deleting.

engedaam commented 2 months ago

Are there any pods that maybe stuck deleting on those nodes?

hamishforbes commented 2 months ago

I'm running into this same issue upgrading to Karpenter 1.0. I haven't deleted my nodepool but it has been changed to the the v1 custom resource.

So it looks like karpenter struggling because the owner reference on the nodeclaim is for the v1beta versions of the nodepool?

> k get nodeclaims -o custom-columns='APIVER:.apiVersion,NAME:.metadata.name,OWNER_API_VER:.metadata.ownerReferences[0].apiVersion,OWNERKIND:.metadata.ownerReferences[0].kind'
APIVER            NAME            OWNER_API_VER          OWNERKIND
karpenter.sh/v1   default-6nd8t   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-7bxsr   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-9chj5   karpenter.sh/v1        NodePool
karpenter.sh/v1   default-ct54l   karpenter.sh/v1        NodePool
karpenter.sh/v1   default-d6xv7   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-d98d8   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-d9kpt   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-gpjpg   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-j5fdd   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-j6wxw   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-jr5qf   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-l2fml   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-lk8bd   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-m2r7j   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-m5vdt   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-m7mtb   karpenter.sh/v1        NodePool
karpenter.sh/v1   default-mk6r6   karpenter.sh/v1        NodePool
karpenter.sh/v1   default-mnndr   karpenter.sh/v1        NodePool
karpenter.sh/v1   default-mz62l   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-pkdl4   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-ptfm8   karpenter.sh/v1        NodePool
karpenter.sh/v1   default-s9df7   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-ttzk8   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   default-wfpbj   karpenter.sh/v1        NodePool
karpenter.sh/v1   default-wwr7z   karpenter.sh/v1        NodePool
karpenter.sh/v1   default-xphxg   karpenter.sh/v1beta1   NodePool

I've got nodes that are empty and should be terminated dangling around, the ec2 instances still running.

Non-terminated Pods:          (5 in total)
  Namespace                   Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                         ------------  ----------  ---------------  -------------  ---
  kube-system                 aws-node-termination-handler-zcnfs           10m (0%)      100m (2%)   64Mi (0%)        64Mi (0%)      3d15h
  kube-system                 aws-node-zlkl9                               50m (1%)      0 (0%)      0 (0%)           0 (0%)         3d15h
  kube-system                 istio-cni-node-jljp6                         50m (1%)      200m (5%)   200Mi (2%)       400Mi (5%)     3d15h
  kube-system                 kube-proxy-pfzxq                             100m (2%)     0 (0%)      0 (0%)           0 (0%)         3d15h
  prometheus                  prometheus-prometheus-node-exporter-mtfdf    50m (1%)      150m (3%)   32Mi (0%)        32Mi (0%)      3d15h

Events:
  Type     Reason                 Age                    From       Message
  ----     ------                 ----                   ----       -------
  Normal   Unconsolidatable       30m (x321 over 3d15h)  karpenter  SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
  Normal   DisruptionBlocked      26m                    karpenter  Cannot disrupt Node: not all pods would schedule, knative-eventing/kafka-broker-receiver-746fb7d66f-679g6 => would schedule against uninitialized nodeclaim/default-mk6r6
  Normal   DisruptionTerminating  22m                    karpenter  Disrupting Node: Underutilized/Delete
  Warning  FailedDraining         22m                    karpenter  Failed to drain node, 15 pods are waiting to be evicted
  Normal   DisruptionBlocked      14m (x5 over 22m)      karpenter  Cannot disrupt Node: state node is marked for deletion
  Normal   DisruptionBlocked      2m27s (x6 over 12m)    karpenter  Cannot disrupt Node: state node is marked for deletion
  Normal   DisruptionBlocked      92s                    karpenter  Cannot disrupt Node: state node is marked for deletion
  Normal   DisruptionBlocked      69s                    karpenter  Cannot disrupt Node: state node is marked for deletion

The major problem this is causing me is that these dangling nodes often still have PVs mounted, which is preventing those pods being re-scheduled on new nodes (the old multi-attach EBS controller problem, that i'm upgrading to 1.0 to try and fix...)

Manually terminating the EC2 instance does eventually cause everything to clean up. As does manually updating the apiVersion of the nodepool in the owner reference on the nodeclaim

engedaam commented 2 months ago

@hamishforbes are you seeing the empty nodes being deleted in the logs by karpenter? It also seems like the nodes are not fully empty, no? ''' Warning FailedDraining 22m karpenter Failed to drain node, 15 pods are waiting to be evicted '''

hamishforbes commented 2 months ago

No, that's the problem. Any nodeclaim where the ownerRef is for the old v1beta1 nodepool does not get deleted. Nodes provisioned after the 1.0 upgrade with an ownerRef for the v1 nodepool are fine.

Yes it says that 20 minutes ago there were 15 pods draining. But as you can see there are only daemonset pods left on that node now. If I fix the nodeclaim ownerRef then Karpenter immediately terminates the EC2 instance and cleans up.

k24dizzle commented 2 months ago

Are there any pods that maybe stuck deleting on those nodes?

No, just some running daemonsets.

I'm experiencing this in clusters where only Karpenter 1.0 exists, so don't think its related to upgrading:

% kubectl get nodeclaims -o custom-columns='APIVER:.apiVersion,NAME:.metadata.name,OWNER_API_VER:.metadata.ownerReferences[0].apiVersion,OWNERKIND:.metadata.ownerReferences[0].kind'
APIVER            NAME          OWNER_API_VER     OWNERKIND
karpenter.sh/v1   infra-4swmq   karpenter.sh/v1   NodePool
karpenter.sh/v1   infra-gp9d4   karpenter.sh/v1   NodePool
karpenter.sh/v1   infra-vxcpn   karpenter.sh/v1   NodePool

k24dizzle commented 2 months ago

I think it has something to do with the karpenter.sh/termination finalizer for the node claim, is there a way I can manually trigger the finalizer to run again (even after the node pool is deleted?). It isn't too clear to me what the finalizer is doing/if it is rerunning and retrying. I don't see any signal in the logs that it is continuing to run, which makes me think it is stuck.

engedaam commented 2 months ago

This issue seems like a duplicate of https://github.com/kubernetes-sigs/karpenter/issues/1578. Could we track the investigation on that issue? It would make it better to keep track

k24dizzle commented 2 months ago

I think it's slightly different. I've experienced this issue:

without upgrading to v1 of karpenter
where the underlying node does not get deleted

engedaam commented 2 months ago

@k24dizzle What version of karpenter are you running?

k24dizzle commented 2 months ago

v1.0.0

aquam8 commented 2 months ago

I have the same problem as the author.

I was running v0.37.2, with webhook.enabled. I upgraded to 1.0.1 (for CRD and app), updated IAM policy but kept my manifests for nodepool and ec2nc as v1beta1. All was well at that stage.

The next step was about updating my manifests for nodepool and ec2nc to reference the v1 CRD so i can leverage the new budget's reasons. But as soon as i updated the manifest to v1 and applied the changes i have encountered issues.

Karpenter logs:

{"level":"ERROR","time":"2024-09-09T06:49:17.087Z","logger":"controller","message":"failed listing instance types for mixed-1","commit":"62a726c","controller":"disruption","namespace":"","name":"","reconcileID":"583102f1-95d3-48f1-99b0-0a76cb430d69","error":"resolving node class, ec2nodeclasses.karpenter.k8s.aws \"mx51-eks\" is terminating, treating as not found
"}
{"level":"ERROR","time":"2024-09-09T06:49:18.369Z","logger":"controller","message":"nodePool not ready","commit":"62a726c","controller":"provisioner","namespace":"","name":"","reconcileID":"9bceb5be-935b-4d2d-b587-8154a3b8e17e","NodePool":{"name":"mixed-1"}}
{"level":"INFO","time":"2024-09-09T06:49:18.369Z","logger":"controller","message":"no nodepools found","commit":"62a726c","controller":"provisioner","namespace":"","name":"","reconcileID":"9bceb5be-935b-4d2d-b587-8154a3b8e17e"}

The nodepool fails with Failed resolving NodeClass.

The nodeclass fails with WaitingOnNodeClaimTermination - Waiting on NodeClaim termination for mixed-1-q25pf, mixed-1-dv8cp

k get nodeclaims -o custom-columns='APIVER:.apiVersion,NAME:.metadata.name,OWNER_API_VER:.metadata.ownerReferences[0].apiVersion,OWNERKIND:.metadata.ownerReferences[0].kind'
APIVER            NAME            OWNER_API_VER          OWNERKIND
karpenter.sh/v1   mixed-1-dv8cp   karpenter.sh/v1beta1   NodePool
karpenter.sh/v1   mixed-1-q25pf   karpenter.sh/v1beta1   NodePool

Recovery from there is painful and hit-and-miss I can't get a new node to register until i kill all existing nodes, or remove the finalizer karpenter.sh/termination on the nodeclaims/nodes. Sometimes i have had to re-add the ec2nc for everything to get going again. But of course this is highly disruptive and not suitable for PROD upgrade.

The way i update/apply the manifest is through IaC terraform through this:

resource "kubectl_manifest" "karpenter_node_pool_ondemand_1" {
  yaml_body          = <<-YAML
    apiVersion: karpenter.sh/v1
    kind: NodePool
    metadata:
      name: ondemand-1
    spec:
       # ...

where NodePool apiVersion is changed from karpenter.sh/v1beta1 to karpenter.sh/v1. Same for the EC2NodeClass. No other changes. I can try to split the changes so that i only do it for NodePool or EC2NodeClass - not both if you think that's helpful for troubleshooting.

I appreciate any assistance on how to address that last leg of the upgrade.

hontarenko commented 1 week ago

Any updates?

sergii-auctane commented 1 day ago

This issue seems like a duplicate of kubernetes-sigs/karpenter#1578. Could we track the investigation on that issue? It would make it better to keep track

It's nothing like that issue. I'm using version 1.0.6 and having tons of empty nodes suck in

  Warning  FailedDraining     7m30s (x4116 over 6d3h)  karpenter  Failed to drain node, 9 pods are waiting to be evicted
  Normal   DisruptionBlocked  2m16s (x7379 over 10d)   karpenter  Cannot disrupt Node: state node is marked for deletion

And those 9 pods are DaemonSets. I observe this issue with non-default node pools only. I run consolidation nightly and the default nodepool consolidates, but not another one.

aws / karpenter-provider-aws

NodeClaims stranded even after NodePool deletion #6905

Description