Closed woehrl01 closed 1 week ago
Happend again and resulted in a stuck scaling of the cluster. This is the state of the NodeClaim causing that error where the finalizer had to be removed by hand.
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
annotations:
karpenter.sh/nodeclaim-termination-timestamp: "2024-08-29T13:13:36Z"
karpenter.sh/nodepool-hash: "11981896997958051328"
karpenter.sh/nodepool-hash-version: v3
creationTimestamp: "2024-08-29T11:13:36Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2024-08-29T11:13:36Z"
finalizers:
- karpenter.sh/termination
generateName: default-ondemand-m5a-
generation: 2
labels:
karpenter.sh/nodepool: default-ondemand-m5a
name: default-ondemand-m5a-sjw2v
ownerReferences:
- apiVersion: karpenter.sh/v1
blockOwnerDeletion: true
kind: NodePool
name: default-ondemand-m5a
uid: 4c16ef32-8ea2-4449-8785-c76d4c3fc989
resourceVersion: "60892317"
uid: c4504a38-a103-4627-8fa0-cb048c9a6c71
spec:
expireAfter: 720h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default-ondemand
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5a.2xlarge
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.k8s.aws/instance-size
operator: In
values:
- 2xlarge
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-central-1a
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- m5a
- key: karpenter.sh/nodepool
operator: In
values:
- default-ondemand-m5a
resources:
requests:
cpu: 1299m
memory: "14336527000"
pods: "74"
terminationGracePeriod: 2h0m0s
status:
conditions:
- lastTransitionTime: "2024-08-29T11:13:36Z"
message: Node not registered with cluster
reason: NodeNotFound
status: Unknown
type: Initialized
- lastTransitionTime: "2024-08-29T11:13:36Z"
message: object is awaiting reconciliation
reason: AwaitingReconciliation
status: Unknown
type: Launched
- lastTransitionTime: "2024-08-29T11:13:36Z"
message: Initialized=Unknown, Registered=Unknown, Launched=Unknown
reason: UnhealthyDependents
status: Unknown
type: Ready
- lastTransitionTime: "2024-08-29T11:13:36Z"
message: Node not registered with cluster
reason: NodeNotFound
status: Unknown
type: Registered
edit: Just found this across another cluster with 5 NodeClaims for spot instances.
Closing as a duplicate https://github.com/kubernetes-sigs/karpenter/issues/1578
Description
Observed Behavior:
Having a stuck update of one NodeClaim results in halt of starting new nodes
In order to unstuck the system I had to remove the finalizer of the NodeClaim by hand. Then everything started to work again.
Expected Behavior:
Everything runs fine
Reproduction Steps (Please include YAML):
Don't know I guess its a race condition somewhere
Versions:
Chart Version: 1.0.1
Kubernetes Version (
kubectl version
): v1.30.3-eks-2f46c53Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment