Karpenter Node Claims Stalled by EC2 Launch Failures

JacobAmar commented 2 months ago

Description

Observed Behavior:

We experienced prolonged (~15 minute) delays in node provisioning due to AWS experiencing internal issues launching c6 instances in us-east-1f (and intermittently us-east-1b). The problem was visible in the EC2 console and not related to capacity constraints.

Currently, Karpenter takes an extended time to react and recover from such failures.

Expected Behavior:

Faster Failure Detection and Response: Quicker retries or fallback mechanisms (e.g., different instance types, AZs) when launch failures occur.
Improved Communication: Clearer logs/events to distinguish launch failures from capacity issues and communicate Karpenter's response.
Potential Configuration Options: Allow users to fine-tune timeouts or thresholds for handling consecutive launch failures.

Reproduction Steps:

create the node pool
create the pod using the deploy.yaml
nodepool.yaml

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  labels:
    argocd.argoproj.io/instance: karpenter-resources
  name: test-node-pool
spec:
  disruption:
    expireAfter: Never
  template:
    metadata:
      labels:
        project: test-node-pool
    spec:
      requirements:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - us-east-1c
            - us-east-1f
            - us-east-1b
            - us-east-1d
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - spot
            - on-demand
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values:
            - nano
            - micro
            - small
            - large
            - 12xlarge
            - 16xlarge
            - 18xlarge
            - 24xlarge
            - 32xlarge
            - 36xlarge
            - 48xlarge
            - metal
            - metal-48xl
            - metal-24xl
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - c6i
            - c6in
            - m6i
            - m6in
            - m6id
            - r6i
            - r6id
            - c7i
            - m7i
            - r7i
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
        - key: kubernetes.io/os
          operator: In
          values:
            - linux

Deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 20
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      nodeSelector:
         topology.k8s.aws/zone-id: use1-az5
         node.kubernetes.io/instance-type: c6i.8xlarge
         project: test-node-pool
      containers:
      - name: my-app
        image: nginx:latest
        resources:
          requests:
            cpu: "28"
            memory: "40Gi"
          limits:
            memory: "40Gi"

Explenation:

The provisioner is configured to use only spot instances.
It explicitly targets the use1-az5 zone using a node affinity rule.
If spot capacity for this instance type is low in use1-az5, the observed issue may occur.

Versions:

Chart Version: 0.37.0

Kubernetes Version (kubectl version):

Client Version: v1.28.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.6-eks-db838b0

jigisha620 commented 2 months ago

Hi @JacobAmar, If Karpenter sees an insufficientCapacityError, it would immediately delete the nodeClaim and start spinning up a new node. It will also update the cache to mark this instance type as unavailable. Can you share logs from when this happened?

JacobAmar commented 2 months ago

Hi @JacobAmar, If Karpenter sees an insufficientCapacityError, it would immediately delete the nodeClaim and start spinning up a new node. It will also update the cache to mark this instance type as unavailable. Can you share logs from when this happened?

Hi, it seems that we encountered an issue where AWS experienced problems launching c6 instances within us-east-1f (and sometimes us-east-1b). This caused Karpenter nodes to remain in an unready state for an extended period (15 minutes) before being terminated. While AWS is addressing the underlying infrastructure issue, it highlights some limitations in Karpenter's current node provisioning behavior.

Currently, Karpenter has a hardcoded 15-minute timeout before terminating a node that fails to launch, as seen in here

We believe this could be improved with two key enhancements:

Configurable Timeout for Node Termination:

Allow users to configure the timeout duration before Karpenter terminates a node that fails to launch. This provides greater control over node provisioning behavior and allows faster response to transient infrastructure issues.

Instance Type Fallback:

Implement a mechanism for Karpenter to automatically try different instance types after a configurable number of unsuccessful launch attempts for a specific instance type. This would improve provisioning success rates when specific instance types are unavailable or experiencing problems.

Availability Zone Fallback:

Extend the fallback mechanism to include trying different Availability Zones after encountering launch failures within a single AZ for a given period. This would help mitigate issues localized to specific AZs.

These enhancements would make Karpenter more resilient to infrastructure disruptions and provide users with more granular control over node provisioning.

jigisha620 commented 2 months ago

Karpenter would fallback to a different instance type if it fails to launch a nodeClaim just like you mentioned in the second enhancement suggestion. However, if any issue occurs during registration we don't expect this instance to be not available. While we can consider having configurable timeout for node termination, I was wondering if you could share your logs for us to see if Karpenter was able to launch the nodeClaims successfully.

aws / karpenter-provider-aws

Karpenter Node Claims Stalled by EC2 Launch Failures #6723

Description

nodepool.yaml

Deploy.yaml