Nodeclaim related to a stopped node was not cleared off

ra-grover commented 2 months ago

Description

Observed Behavior: Nodeclaim belonging to a AWS node which was stopped for reason: Server.ScheduledStop: Stopped due to scheduled retirement was not cleared off from the cluster. The node belonging to the claim no longer was in the cluster. Karpenter tried to schedule a pod on that specific nodeclaim, but the underlying node was gone.

Expected Behavior: If a Node is not ready for any reason, Karpenter should remove the Nodeclaim to correctly schedule the pods.

Reproduction Steps (Please include YAML):

Stop a node due to reason Server.ScheduledStop: Stopped due to scheduled retirement (Not sure if it can be explicitly done)

Nodeclaim in the cluster still exists Outputs showing in our EKS clusters:


k describe nodeclaims general-provisioner-mssqp | grep -i nodename
Node Name:               ip-10-112-148-87.ec2.internal

k get nodes ip-10-112-148-87.ec2.internal Error from server (NotFound): nodes "ip-10-112-148-87.ec2.internal" not found


Please let me know if you require the status of the nodeclaim.

**Versions**:
- Chart Version: 0.36.2
- Kubernetes Version (`kubectl version`): `v1.25.11`

* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment
Interested in solving the issue

ra-grover commented 2 months ago

Conditions on the nodeclaim:

Conditions:
    Last Transition Time:  2024-08-01T17:00:20Z
    Reason:                AMIDrift
    Severity:              Warning
    Status:                True
    Type:                  Drifted
    Last Transition Time:  2024-08-01T21:17:21Z
    Severity:              Warning
    Status:                True
    Type:                  Expired
    Last Transition Time:  2024-07-12T21:18:26Z
    Status:                True
    Type:                  Initialized
    Last Transition Time:  2024-07-12T21:17:22Z
    Status:                True
    Type:                  Launched
    Last Transition Time:  2024-07-12T21:18:26Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2024-07-12T21:17:47Z
    Status:                True
    Type:                  Registered

jmdeal commented 2 months ago

This nodeclaim should have been eventually removed by the garbage collection controller, and I believe this should have been caught by the interruption controller as well. Have you configured an interruption queue for Karpenter? Also, how long after the instance termination was the NodeClaim still around. Are you able to share logs from the event?

engedaam commented 2 months ago

@ra-grover any updates here?

ra-grover commented 2 months ago

Apologies for missing out on this one.

Have you configured an interruption queue for Karpenter?

We dont have any interruption queue configured for Karpenter but have aws nth running in SQS mode.

Also, how long after the instance termination was the NodeClaim still around. Are you able to share logs from the event?

The node claim was still around after around 15 days of it being stopped. I dont think I have access to logs for that node, there are no k8s events also in our event management system too, except one:

1 FailedScheduling: Failed to schedule pod, would schedule against a non-initialized node general-provisioner-mssqp

Also when I started the node manually from AWS console, the node came up fine in the K8s cluster.

aws / karpenter-provider-aws

Nodeclaim related to a stopped node was not cleared off #6866

Description