aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.6k stars 919 forks source link

"error": "removing termination finalizer, NodeClaim.karpenter.sh is invalid: spec: Invalid Value "object": spec is immutable #6895

Closed woehrl01 closed 1 week ago

woehrl01 commented 2 weeks ago

Description

Observed Behavior:

Having a stuck update of one NodeClaim results in halt of starting new nodes

Bildschirmfoto 2024-08-29 um 15 34 53

In order to unstuck the system I had to remove the finalizer of the NodeClaim by hand. Then everything started to work again.

Expected Behavior:

Everything runs fine

Reproduction Steps (Please include YAML):

Don't know I guess its a race condition somewhere

Versions:

woehrl01 commented 2 weeks ago

Happend again and resulted in a stuck scaling of the cluster. This is the state of the NodeClaim causing that error where the finalizer had to be removed by hand.

apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
  annotations:
    karpenter.sh/nodeclaim-termination-timestamp: "2024-08-29T13:13:36Z"
    karpenter.sh/nodepool-hash: "11981896997958051328"
    karpenter.sh/nodepool-hash-version: v3
  creationTimestamp: "2024-08-29T11:13:36Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-08-29T11:13:36Z"
  finalizers:
  - karpenter.sh/termination
  generateName: default-ondemand-m5a-
  generation: 2
  labels:
    karpenter.sh/nodepool: default-ondemand-m5a
  name: default-ondemand-m5a-sjw2v
  ownerReferences:
  - apiVersion: karpenter.sh/v1
    blockOwnerDeletion: true
    kind: NodePool
    name: default-ondemand-m5a
    uid: 4c16ef32-8ea2-4449-8785-c76d4c3fc989
  resourceVersion: "60892317"
  uid: c4504a38-a103-4627-8fa0-cb048c9a6c71
spec:
  expireAfter: 720h
  nodeClassRef:
    group: karpenter.k8s.aws
    kind: EC2NodeClass
    name: default-ondemand
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - m5a.2xlarge
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: karpenter.k8s.aws/instance-size
    operator: In
    values:
    - 2xlarge
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - eu-central-1a
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values:
    - m5a
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - default-ondemand-m5a
  resources:
    requests:
      cpu: 1299m
      memory: "14336527000"
      pods: "74"
  terminationGracePeriod: 2h0m0s
status:
  conditions:
  - lastTransitionTime: "2024-08-29T11:13:36Z"
    message: Node not registered with cluster
    reason: NodeNotFound
    status: Unknown
    type: Initialized
  - lastTransitionTime: "2024-08-29T11:13:36Z"
    message: object is awaiting reconciliation
    reason: AwaitingReconciliation
    status: Unknown
    type: Launched
  - lastTransitionTime: "2024-08-29T11:13:36Z"
    message: Initialized=Unknown, Registered=Unknown, Launched=Unknown
    reason: UnhealthyDependents
    status: Unknown
    type: Ready
  - lastTransitionTime: "2024-08-29T11:13:36Z"
    message: Node not registered with cluster
    reason: NodeNotFound
    status: Unknown
    type: Registered

edit: Just found this across another cluster with 5 NodeClaims for spot instances.

engedaam commented 1 week ago

Closing as a duplicate https://github.com/kubernetes-sigs/karpenter/issues/1578