aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.87k stars 967 forks source link

Potential ebs nodeclaim issue? #7241

Open drawnwren opened 1 month ago

drawnwren commented 1 month ago

Description

I'm not sure if this is #7046 or not, but our production cluster is unable to provision new nodes and state node doesn't contain both a node and a nodeclaim. We were previously having trouble with the ebs-csi-provisioner and had to force delete a node after having patched the finalizers off manually kubectl patch node -p '{"metadata":{"finalizers":null}}' ip-----.us-east-2.compute.internal. We don't have the associated "ERROR" message in our karpenter logs.

Here's our nodepool:

Name:         default
Namespace:
Labels:       kustomize.toolkit.fluxcd.io/name=flux-system
              kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations:  karpenter.sh/nodepool-hash: 14345729850401713597
              karpenter.sh/nodepool-hash-version: v3
              karpenter.sh/stored-version-migrated: true
API Version:  karpenter.sh/v1
Kind:         NodePool
Metadata:
  Creation Timestamp:  2024-08-26T23:55:00Z
  Generation:          5
  Resource Version:    132372706
  UID:                 694bea35-1b07-4b45-b71e-a647bc9b65a7
Spec:
  Disruption:
    Budgets:
      Nodes:               10%
    Consolidate After:     1h
    Consolidation Policy:  WhenEmptyOrUnderutilized
  Limits:
    Cpu:  100
  Template:
    Metadata:
      Labels:
        Node - Type:  basis-karpenter
    Spec:
      Expire After:  720h
      Node Class Ref:
        Group:  karpenter.k8s.aws
        Kind:   EC2NodeClass
        Name:   default
      Requirements:
        Key:       karpenter.sh/capacity-type
        Operator:  In
        Values:
          spot
          on-demand
        Key:       kubernetes.io/arch
        Operator:  In
        Values:
          amd64
        Key:       node.kubernetes.io/instance-type
        Operator:  In
        Values:
          t3.small
          t3.medium
          t3.large
  Weight:  99
Status:
  Conditions:
    Last Transition Time:  2024-08-26T23:55:00Z
    Message:
    Reason:                NodeClassReady
    Status:                True
    Type:                  NodeClassReady
    Last Transition Time:  2024-08-26T23:55:00Z
    Message:
    Reason:                Ready
    Status:                True
    Type:                  Ready
    Last Transition Time:  2024-08-26T23:55:00Z
    Message:
    Reason:                ValidationSucceeded
    Status:                True
    Type:                  ValidationSucceeded
  Resources:
    Cpu:                  12
    Ephemeral - Storage:  1258229660Ki
    hugepages-1Gi:        0
    hugepages-2Mi:        0
    Memory:               35768548Ki
    Nodes:                6
    Pods:                 117
Events:                   <none>

We also have a gpu nodepool that appears to be working fine.

Here are the controller logs: karpenter_logs.txt

engedaam commented 1 month ago

Why do you beleive this is an issue with the CSI Driver? Are you seeing that event relating to nodes that patched out the finalizers? From looking at the logs for looks to be a duplicate https://github.com/aws/karpenter-provider-aws/issues/7046

drawnwren commented 1 month ago

We recently had some issues with upgrading the kubernetes and ebs versions (#7200), so my assumption was that this is somehow related to that. This cluster has been relatively stable until upgrades for the last 9 months or so and now node provisioning is suddenly failing.

engedaam commented 1 month ago

Can you provide any Karpenter logs?