aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.88k stars 969 forks source link

NodeClaim stuck in 'Unknown' (Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim) #7435

Closed keoren3 closed 2 days ago

keoren3 commented 3 days ago

Description

Observed Behavior: Karpenter raises a new EC2, but it doesn't connect to the EKS - Instead it's stuck in status 'Unknown': "Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim"

Expected Behavior: The node is added to the EKS cluster

Reproduction Steps (Please include YAML): EC2NodeClass + Nodepool

---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: karpenter
  namespace: kube-system
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: karpenter
      expireAfter: 720h  # 30 * 24h = 720h
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: karpenter
  namespace: kube-system
spec:
  amiFamily: AL2  # Amazon Linux 2
  role: "KarpenterNodeRole-<cluster>"  # replace with your cluster name
  subnetSelectorTerms:
    - tags:
        Environment: test
        Tier: public
  securityGroupSelectorTerms:
    - tags:
        "aws:eks:cluster-name": <cluster>
  amiSelectorTerms:
    - id: ami-00710ab8f493e2428
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 10000
        encrypted: false
        deleteOnTermination: true
        snapshotID: snap-0ec4fd6705eea533e
  tags:
    env: test
    Name: balance-test-karpenter

Extra info: I'm trying to replace my Auto-Scaler with Karpenter. I gave the nodes the exact same:

  1. IAM role.
  2. Security-Group.
  3. EBS volume is based on the same snapshot.
  4. Exact same AMI.

I've added the required roles to aws-auth:

- groups:
  - system:nodes
  - system:bootstrappers
  rolearn: arn:aws:iam::<account_id>:role/KarpenterNodeRole-<cluster>
  username: system:node:{{EC2PrivateDNSName}}

I've entered the EKS worker node and ran: journalctl -u kubelet but no entry was added there.

I tried changing the roles name, tried adding permissions to the role, tried adding permissions to the SG, Nothing, the nodes just refuse to connect.

Karpenter logs:

{"level":"INFO","time":"2024-11-25T15:53:54.818Z","logger":"controller","message":"found provisionable pod(s)","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"6703d662-e9b1-4f99-9c55-e72f0aaa6b7e","Pods":"over-provisioning/over-provisioning-6d568b6cf8-7tqjr, over-provisioning/over-provisioning-6d568b6cf8-8p8bf","duration":"181.987056ms"}
{"level":"INFO","time":"2024-11-25T15:53:54.818Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"6703d662-e9b1-4f99-9c55-e72f0aaa6b7e","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-11-25T15:53:54.819Z","logger":"controller","message":"computed 1 unready node(s) will fit 1 pod(s)","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"6703d662-e9b1-4f99-9c55-e72f0aaa6b7e"}
{"level":"INFO","time":"2024-11-25T15:53:54.842Z","logger":"controller","message":"created nodeclaim","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"6703d662-e9b1-4f99-9c55-e72f0aaa6b7e","NodePool":{"name":"karpenter"},"NodeClaim":{"name":"karpenter-wxbnk"},"requests":{"cpu":"1780m","memory":"2418Mi","pods":"6"},"instance-types":"c4.large, c5.large, c5.xlarge, c5a.large, c5a.xlarge and 55 other(s)"}
{"level":"INFO","time":"2024-11-25T15:53:58.329Z","logger":"controller","message":"launched nodeclaim","commit":"a2875e3","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"karpenter-wxbnk"},"namespace":"","name":"karpenter-wxbnk","reconcileID":"01c8fd77-f4cb-4572-a600-333737c2caeb","provider-id":"aws:///us-east-2b/i-052e9c4c5d91c8767","instance-type":"c7i-flex.large","zone":"us-east-2b","capacity-type":"spot","allocatable":{"cpu":"1930m","ephemeral-storage":"89Gi","memory":"3114Mi","pods":"29"}}

(No errors AFAIK)

The auto-scaler still works as expected though - I raise the deployment replicas to 1, and everything works as expected.

Is this a bug? Or am I missing anything.

I've looked at all the other topics about this, all the solutions are "Oh I missed some tag" - I've checked all the tags again and again, it's not the issue.

Any help would be great.

Versions:

keoren3 commented 2 days ago

I was able to fix this, I'll elaborate on what I did:

  1. I've created a different nodegroup (And didn't use the old one that my Cluster autoscaler used)
  2. I updated the AMI - Instead of using the one my ASG used, I'm using the latest:
    Ami Selector Terms:
    Name:  amazon-eks-node-1.31-*
    # Instead of the id: ami-00710ab8f493e2428
  3. Removed the old snapshot used, and just let Karpenter generate one (GP3), and removed the encryption of the EBS
  4. Updated the aws-auth:
    1. Removed the 'KarpenterControllerRole-' iam identity mapping (It's not required there, I don't know why I've added it in the first place)
    2. Deleted and re-added my 'KarpenterControllerRole-' iam identity mapping - I think I was missing the line 'username: system:node:{{EC2PrivateDNSName}}'

TBH, I'm not so sure what caused it to work, it might be related to more than 1 thing.