aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.88k stars 969 forks source link

no nodepools found Error for new install of 1.0.5 #7436

Closed JoeSpiral closed 3 days ago

JoeSpiral commented 3 days ago

Description

Observed Behavior: I have a new EKS 1.31 cluster with Karpenter 1.0.5 install via the Terraform eks-blueprints module. I have similar clusters up and running successfully that were installed the same way but they are k8s 1.29 and Karpenter 1.0.1. New nodes are not being created when scale up is needed. Below is the log.

{"level":"INFO","time":"2024-11-25T15:49:49.205Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"metrics.node","worker count":1}
{"level":"INFO","time":"2024-11-25T15:49:49.198Z","logger":"controller","message":"Starting EventSource","commit":"652e6aa","controller":"metrics.nodepool","controllerGroup":"karpenter.sh","controllerKind":"NodePool","source":"kind s
ource: *v1.NodePool"}
{"level":"INFO","time":"2024-11-25T15:49:49.205Z","logger":"controller","message":"Starting Controller","commit":"652e6aa","controller":"metrics.nodepool","controllerGroup":"karpenter.sh","controllerKind":"NodePool"}
{"level":"INFO","time":"2024-11-25T15:49:49.202Z","logger":"controller","message":"Starting EventSource","commit":"652e6aa","controller":"status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","source":"kind so
urce: *v1.EC2NodeClass"}
{"level":"INFO","time":"2024-11-25T15:49:49.205Z","logger":"controller","message":"Starting Controller","commit":"652e6aa","controller":"status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass"}
{"level":"INFO","time":"2024-11-25T15:49:49.410Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"migration.crd","controllerGroup":"apiextensions.k8s.io","controllerKind":"CustomResourceDefinition"
,"worker count":1}
{"level":"INFO","time":"2024-11-25T15:49:49.411Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"state.pod","controllerGroup":"","controllerKind":"Pod","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.412Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodepool.readiness","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.412Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"lease.garbagecollection","controllerGroup":"coordination.k8s.io","controllerKind":"Lease","worker c
ount":10}
{"level":"INFO","time":"2024-11-25T15:49:49.412Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclaim.podevents","controllerGroup":"","controllerKind":"Pod","worker count":1}
{"level":"INFO","time":"2024-11-25T15:49:49.412Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclaim.consistency","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count"
:10}
{"level":"INFO","time":"2024-11-25T15:49:49.412Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"node.termination","controllerGroup":"","controllerKind":"Node","worker count":100}
{"level":"INFO","time":"2024-11-25T15:49:49.412Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"state.nodepool","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.414Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodepool.counter","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.414Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"metrics.pod","controllerGroup":"","controllerKind":"Pod","worker count":1}
{"level":"INFO","time":"2024-11-25T15:49:49.414Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodepool.validation","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":10
}
{"level":"INFO","time":"2024-11-25T15:49:49.414Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":1
000}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count"
:100}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"status","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclaim.disruption","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":
10}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"status","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"migration.resource.nodeclaim","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker
 count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclaim.expiration","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":
1}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"migration.resource.nodepool","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker c
ount":10}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"provisioner.trigger.pod","controllerGroup":"","controllerKind":"Pod","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodepool.hash","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.416Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"provisioner.trigger.node","controllerGroup":"","controllerKind":"Node","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.418Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"state.daemonset","controllerGroup":"apps","controllerKind":"DaemonSet","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.422Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"state.node","controllerGroup":"","controllerKind":"Node","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.423Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclass.hash","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","worker count
":10}
{"level":"INFO","time":"2024-11-25T15:49:49.423Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","worker cou
nt":10}
{"level":"INFO","time":"2024-11-25T15:49:49.423Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclaim.tagging","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":1}
{"level":"INFO","time":"2024-11-25T15:49:49.423Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"nodeclass.termination","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","worke
r count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.423Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"migration.resource.ec2nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeCla
ss","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.424Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"state.nodeclaim","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":10}
{"level":"INFO","time":"2024-11-25T15:49:49.433Z","logger":"controller","message":"Starting workers","commit":"652e6aa","controller":"metrics.nodepool","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":1}
{"level":"INFO","time":"2024-11-25T15:49:49.514Z","logger":"controller","message":"discovered ssm parameter","commit":"652e6aa","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC
2NodeClass":{"name":"default"},"namespace":"","name":"default","reconcileID":"585d48f9-9c71-4e23-9ef2-54c998f0016e","parameter":"/aws/service/eks/optimized-ami/1.31/amazon-linux-2/recommended/image_id","value":"ami-0541903c03df81c16"
}
{"level":"INFO","time":"2024-11-25T15:49:49.537Z","logger":"controller","message":"discovered ssm parameter","commit":"652e6aa","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC
2NodeClass":{"name":"default"},"namespace":"","name":"default","reconcileID":"585d48f9-9c71-4e23-9ef2-54c998f0016e","parameter":"/aws/service/eks/optimized-ami/1.31/amazon-linux-2-arm64/recommended/image_id","value":"ami-0104decf27a3
167ca"}
{"level":"INFO","time":"2024-11-25T15:49:49.554Z","logger":"controller","message":"discovered ssm parameter","commit":"652e6aa","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC
2NodeClass":{"name":"default"},"namespace":"","name":"default","reconcileID":"585d48f9-9c71-4e23-9ef2-54c998f0016e","parameter":"/aws/service/eks/optimized-ami/1.31/amazon-linux-2-gpu/recommended/image_id","value":"ami-07a684ca16d05d
a49"}
{"level":"INFO","time":"2024-11-25T15:49:50.458Z","logger":"controller","message":"pod(s) have a preferred Anti-Affinity which can prevent consolidation","commit":"652e6aa","controller":"provisioner","namespace":"","name":"","reconci
leID":"b9b49daf-3974-49ce-a85c-f3424b99356b","pods":"argocd/argocd-application-controller-0, argocd/argocd-server-6f848b85cb-95j48"}
{"level":"ERROR","time":"2024-11-25T15:49:50.458Z","logger":"controller","message":"nodePool not ready","commit":"652e6aa","controller":"provisioner","namespace":"","name":"","reconcileID":"b9b49daf-3974-49ce-a85c-f3424b99356b","Node
Pool":{"name":"default"}}
{"level":"INFO","time":"2024-11-25T15:49:50.458Z","logger":"controller","message":"no nodepools found","commit":"652e6aa","controller":"provisioner","namespace":"","name":"","reconcileID":"b9b49daf-3974-49ce-a85c-f3424b99356b"}
{"level":"ERROR","time":"2024-11-25T15:50:00.474Z","logger":"controller","message":"nodePool not ready","commit":"652e6aa","controller":"provisioner","namespace":"","name":"","reconcileID":"c9be084b-0855-4fae-9279-60cd53a1c78c","Node
Pool":{"name":"default"}}
{"level":"INFO","time":"2024-11-25T15:50:00.474Z","logger":"controller","message":"no nodepools found","commit":"652e6aa","controller":"provisioner","namespace":"","name":"","reconcileID":"c9be084b-0855-4fae-9279-60cd53a1c78c"}
{"level":"ERROR","time":"2024-11-25T15:50:10.476Z","logger":"controller","message":"nodePool not ready","commit":"652e6aa","controller":"provisioner","namespace":"","name":"","reconcileID":"d56b7ee3-4450-4b0f-a1df-5d64aa84f14f","Node
Pool":{"name":"default"}}
{"level":"INFO","time":"2024-11-25T15:50:10.476Z","logger":"controller","message":"no nodepools found","commit":"652e6aa","controller":"provisioner","namespace":"","name":"","reconcileID":"d56b7ee3-4450-4b0f-a1df-5d64aa84f14f"}

Expected Behavior: New nodes would be spun up.

Reproduction Steps (Please include YAML): aws-auth

kubectl get configmaps -n kube-system aws-auth -o yaml
apiVersion: v1
data:
  mapAccounts: |
    []
  mapRoles: |
    - "groups":
      - "system:bootstrappers"
      - "system:nodes"
      "rolearn": "arn:aws:iam::myaccount:role/karpenter-us-east-2-newcluster-prod1"
      "username": "system:node:{{EC2PrivateDNSName}}"

Configs

apiVersion: karpenter.k8s.aws/v1 kind: EC2NodeClass metadata: namespace: karpenter finalizers:


apiVersion: karpenter.sh/v1 kind: NodePool metadata: labels: app.kubernetes.io/instance: karpenter-spiral-siaas-prod1-blue app.kubernetes.io/managed-by: Helm name: default namespace: karpenter spec: disruption: budgets:

Versions:

JoeSpiral commented 3 days ago

An update, my ec2nodeclass is not ready I that appears to be the root of the issue but I am unsure why.


  Type    Reason                Age   From       Message
  ----    ------                ----  ----       -------
  Normal  AMIsReady             38s   karpenter  Status condition transitioned, Type: AMIsReady, Status: True -> Unknown, Reason: AwaitingReconciliation, Message: object is awaiting reconciliation
  Normal  Ready                 38s   karpenter  Status condition transitioned, Type: Ready, Status: False -> Unknown, Reason: UnhealthyDependents, Message: InstanceProfileReady=Unknown, SecurityGroupsReady=Unknown, SubnetsReady=Unkn
own, AMIsReady=Unknown
  Normal  SecurityGroupsReady   38s   karpenter  Status condition transitioned, Type: SecurityGroupsReady, Status: False -> Unknown, Reason: AwaitingReconciliation, Message: object is awaiting reconciliation
  Normal  SubnetsReady          38s   karpenter  Status condition transitioned, Type: SubnetsReady, Status: True -> Unknown, Reason: AwaitingReconciliation, Message: object is awaiting reconciliation
  Normal  AMIsReady             38s   karpenter  Status condition transitioned, Type: AMIsReady, Status: Unknown -> True, Reason: AMIsReady
  Normal  InstanceProfileReady  38s   karpenter  Status condition transitioned, Type: InstanceProfileReady, Status: Unknown -> True, Reason: InstanceProfileReady
  Normal  Ready                 38s   karpenter  Status condition transitioned, Type: Ready, Status: Unknown -> False, Reason: UnhealthyDependents, Message: SecurityGroupsReady=False
  Normal  SecurityGroupsReady   38s   karpenter  Status condition transitioned, Type: SecurityGroupsReady, Status: Unknown -> False, Reason: SecurityGroupsNotFound, Message: SecurityGroupSelector did not match any SecurityGroups
  Normal  SubnetsReady          38s   karpenter  Status condition transitioned, Type: SubnetsReady, Status: Unknown -> True, Reason: SubnetsReady
keoren3 commented 3 days ago

This is your issue:

Normal SecurityGroupsReady 38s karpenter Status condition transitioned, Type: SecurityGroupsReady, Status: Unknown -> False, Reason: SecurityGroupsNotFound, Message: SecurityGroupSelector did not match any SecurityGroups

I can see you're looking for this sg:

securityGroupSelectorTerms: tags: karpenter.sh/discovery: us-east-2-newcluster-prod1

Did you add these tags to a SG in the same VPC?

JoeSpiral commented 3 days ago

I'm not sure how I missed that but you are 100% correct. Thanks!!

irfn commented 2 days ago

I have a similar case, however ec2nodeclass has SubnetsReady=True, SecurityGroupsReady=True and InstanceProfileReady=True however AMIsReady is in AwaitingReconciliation

I tried various amiFamily options with AL2, AL2023. tried even specifying the image id.

eg:

  amiFamily: AL2023
  amiSelectorTerms:
    - alias: al2023@v20241121

Also tried just the alias with no amiFamily

  amiSelectorTerms:
    - alias: al2023@latest

Any thoughts on what could be the issue? Also once created I can't delete an ec2nodeclass even if there are no references to this.

My Helm, EKS versions are [karpenter 1.0.8, EKS 1.31]

keoren3 commented 2 days ago

@irfn You can't delete the ec2nc because you probably have nodeclaims. Once you delete all the nodeclaims, you'll be able to remove the ec2nc.

Pay attention, you can't delete any nodeclaims as long as the Karpenter deployments are not up.

Maybe that reconciliation error is because the same ec2nc is trying to use 2 different AMI's?

Try deleting all the nodeclaims and the ec2nc, once they're down, re-raise with the AL2 (That worked for me with the exact same settings), this is my ec2nc yaml (Created using Terraform):

---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: karpenter
  namespace: kube-system
spec:
  amiFamily: AL2  # Amazon Linux 2
  role: "KarpenterNodeRole-${CLUSTER_NAME}"  # replace with your cluster name
  subnetSelectorTerms:
    - tags:
        Environment: test
        Tier: private
  securityGroupSelectorTerms:
    - tags:
        "aws:eks:cluster-name": ${CLUSTER_NAME}
  amiSelectorTerms:
    - name: amazon-eks-node-1.31-*
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 10000
        encrypted: false
        deleteOnTermination: true
  tags:
    env: test
    Name: ${CLUSTER_NAME}-karpenter
irfn commented 1 day ago

@keoren3 Like I mentioned, there are no NodeClaims or any references to this. I also tried the name pattern amiSelectorTerms as well as direct amiId refs

Here is one example of ec2nodeclasses that I tried with similar name pattern. the pattern is verified via

aws ec2 describe-images --image-ids ami-00516539f0211c275 etc.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: karpenter-nc2-al2023
spec:
  instanceProfile: test-karpenter-node-instance-profile
  amiFamily: AL2023
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: test-eks
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: test-eks
  amiSelectorTerms:
    - name: amazon-eks-node-al2023-x86_64-standard-1.31-*
    - name: amazon-eks-node-al2023-arm64-standard-1.31-*
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 30Gi
        volumeType: gp3
        encrypted: false
        deleteOnTermination: true
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: optional

Try deleting all the nodeclaims and the ec2nc,

Unable to do this as I don't have any NodeClaims and cannot delete ec2nodeclasses

irfn commented 1 day ago

this is fixed and was an issue in my Terraform code and missed the IAM Role on Image Describe.