aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.79k stars 955 forks source link

Karpenter 1.0.2 fails to remove stale nodeclaims leftover from 0.37.3 upgrade #6981

Open dcherniv opened 1 month ago

dcherniv commented 1 month ago

Description

Observed Behavior: After upgrade to 1.0.2 following upgrade path https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure The webhook didn't properly convert the nodeclaims as a result there are a number of claims that are failing:

[
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-n8fjh\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-wfcrj\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-djh4m\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-mckmt\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-jpnwj\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-9sdns\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-hbf52\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  }
]

These nodeclaims cannot be removed by force because karpenter 1.0.2 fails to delete them due to the aforementioned error. These nodeclaims cannot be edited to remove the finalizer because they are immutable.

Expected Behavior: Old nodeclaims left over from 0.37.3 properly cleaned up and replaced by new nodeclaims.

Reproduction Steps (Please include YAML): Follow the upgrade path, see https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2342521512

Versions:

dcherniv commented 1 month ago

Worked around this issue via https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2344833535 It is very painful.

jonathan-innis commented 1 month ago

Can you share the NodeClaim that existed prior to the conversion as well as the config for the CRD and the deployment? This looks like it should have been defaulted by the conversion webhook so I'm surprised that this converted over without the group being set

dcherniv commented 1 month ago

I have more explanation here https://github.com/aws/karpenter-provider-aws/issues/6847#issuecomment-2344833535 The nodeclaims, while on 1.0.2 were identical, just some of them didn't have the group set for some reason The nodepool manifest that was deployed at the time:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  labels:
    argocd.argoproj.io/instance: prod-karpenter
  name: default-arm64
spec:
  disruption:
    consolidateAfter: 0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: '36'
  template:
    metadata:
      labels:
        arch: arm64
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - m7g
            - m6g
            - c7g
            - c6g
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values:
            - '2'
            - '4'
            - '8'
            - '16'
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: kubernetes.io/arch
          operator: In
          values:
            - arm64
      taints:
        - effect: NoSchedule
          key: arch
          value: arm64
  weight: 50

The CRDs were from version 1.0.2 of karpenter-crd chart installed into karpenter namespace.

Relevant ec2nodeclass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  labels:
    argocd.argoproj.io/instance: prod-karpenter
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        volumeSize: 100Gi
        volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: initial-eks-node-group-20230826183814285900000005
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod
  subnetSelectorTerms:
    - tags:
        Name: '*private*'
        karpenter.sh/discovery: prod
  tags:
    karpenter.sh/discovery: prod
jonathan-innis commented 1 month ago

just some of them didn't have the group set for some reason

I suspect that they didn't have the group because you changed versions without enabling the conversion webhooks. This is going to cause fields to get dropped and the apiserver doesn't know how to hydrate the data.

Regardless, I suspect that this fix is going to solve your problem because we will no longer be relying on the spec to be configured correctly for the Patch to succeed. Allowing you to get unblocked from this issue.

Regardless, I wouldn't recommend going through the path without the conversion webhooks because you may end up in an undefined state.

jonathan-innis commented 1 month ago

Willing to try installing a snapshot to validate if this fixes the issue? Here's the command for installing the snapshot version of Karpenter with the patch fix

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-1f88e3ae893d937267c5c28420597ea5be1d77ef" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait
ShahroZafar commented 1 month ago

We plan to go from v0.37.3 to v1.0.2 with webhook enabled. Is this issue resolved? Or we should wait for a newer patch version in v1.0

adawalli commented 1 month ago

I have the issue on 1.0.2 yes - updating to 1.0.3 did not remove this error