Open ra-grover opened 2 months ago
Conditions on the nodeclaim:
Conditions:
Last Transition Time: 2024-08-01T17:00:20Z
Reason: AMIDrift
Severity: Warning
Status: True
Type: Drifted
Last Transition Time: 2024-08-01T21:17:21Z
Severity: Warning
Status: True
Type: Expired
Last Transition Time: 2024-07-12T21:18:26Z
Status: True
Type: Initialized
Last Transition Time: 2024-07-12T21:17:22Z
Status: True
Type: Launched
Last Transition Time: 2024-07-12T21:18:26Z
Status: True
Type: Ready
Last Transition Time: 2024-07-12T21:17:47Z
Status: True
Type: Registered
This nodeclaim should have been eventually removed by the garbage collection controller, and I believe this should have been caught by the interruption controller as well. Have you configured an interruption queue for Karpenter? Also, how long after the instance termination was the NodeClaim still around. Are you able to share logs from the event?
@ra-grover any updates here?
Apologies for missing out on this one.
Have you configured an interruption queue for Karpenter?
We dont have any interruption queue configured for Karpenter but have aws nth running in SQS mode.
Also, how long after the instance termination was the NodeClaim still around. Are you able to share logs from the event?
The node claim was still around after around 15 days of it being stopped. I dont think I have access to logs for that node, there are no k8s events also in our event management system too, except one:
1 FailedScheduling: Failed to schedule pod, would schedule against a non-initialized node general-provisioner-mssqp
Also when I started the node manually from AWS console, the node came up fine in the K8s cluster.
Description
Observed Behavior: Nodeclaim belonging to a AWS node which was stopped for reason:
Server.ScheduledStop: Stopped due to scheduled retirement
was not cleared off from the cluster. The node belonging to the claim no longer was in the cluster. Karpenter tried to schedule a pod on that specific nodeclaim, but the underlying node was gone.Expected Behavior: If a Node is not ready for any reason, Karpenter should remove the Nodeclaim to correctly schedule the pods.
Reproduction Steps (Please include YAML):
Server.ScheduledStop: Stopped due to scheduled retirement
(Not sure if it can be explicitly done)k get nodes ip-10-112-148-87.ec2.internal Error from server (NotFound): nodes "ip-10-112-148-87.ec2.internal" not found