aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.81k stars 958 forks source link

Accelerated GPU instance NodePool definition yields error "no instance type satisfied resources" #6884

Open csm-kb opened 2 months ago

csm-kb commented 2 months ago

Description

Context:

Hey! I have Karpenter deployed very neatly to an EKS cluster using FluxCD to automatically manage Helm charts:

(click to expand) Helm release for Karpenter ```yaml # including HelmRepository here, even though it is in a separate file apiVersion: source.toolkit.fluxcd.io/v1beta2 kind: HelmRepository metadata: name: karpenter namespace: flux-system spec: type: "oci" url: oci://public.ecr.aws/karpenter interval: 30m --- apiVersion: v1 kind: Namespace metadata: name: karpenter --- apiVersion: helm.toolkit.fluxcd.io/v2 kind: HelmRelease metadata: name: karpenter-crd namespace: karpenter spec: interval: 5m chart: spec: chart: karpenter-crd version: ">=1.0.0 <2.0.0" sourceRef: kind: HelmRepository name: karpenter namespace: flux-system install: remediation: retries: 3 values: webhook: enabled: true serviceName: karpenter serviceNamespace: karpenter port: 8443 --- apiVersion: helm.toolkit.fluxcd.io/v2 kind: HelmRelease metadata: name: karpenter namespace: karpenter spec: interval: 5m chart: spec: chart: karpenter version: ">=1.0.0 <2.0.0" sourceRef: kind: HelmRepository name: karpenter namespace: flux-system install: remediation: retries: 3 values: webhook: enabled: true port: 8443 replicas: 2 logLevel: debug controller: resources: requests: cpu: 1 memory: 1Gi limits: cpu: 1 memory: 1Gi settings: clusterName: "bench-cluster" interruptionQueue: "Karpenter-bench-cluster" serviceAccount: create: true annotations: eks.amazonaws.com/role-arn: "arn:aws:iam:::role/KarpenterController-20240815204005347400000005" ```

I then have three NodePools (and associated EC2NodeClasses) that take different workloads, depending on what pods get launched with what affinities/taints to request where they go. The two NodePools that rely on normal compute instance types like C/M/R work very well, and Karpenter works flawlessly to scale the node pools and serve those pods!

However...

Observed Behavior:

The third NodePool is for workloads that require a G instance with NVIDIA compute to run.

Simple enough, right? YAML:

(click to expand) Karpenter resource definition YAML ```yaml apiVersion: karpenter.k8s.aws/v1 kind: EC2NodeClass metadata: name: ep-nodeclass spec: amiFamily: AL2 role: "bench-main-ng-eks-node-group-20240620210345707900000001" subnetSelectorTerms: - tags: "karpenter.sh/discovery-bench-cluster": "true" securityGroupSelectorTerms: - tags: "karpenter.sh/discovery-bench-cluster": "true" amiSelectorTerms: # acquired from https://github.com/awslabs/amazon-eks-ami/releases - name: "amazon-eks-gpu-node-1.30-v*" kubelet: maxPods: 1 --- apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: ep-base spec: template: metadata: labels: example.com/taint-ep-base: "true" annotations: Env: "staging" Project: "autotest" spec: taints: - key: example.com/taint-ep-base effect: NoSchedule requirements: - key: kubernetes.io/arch operator: In values: ["amd64"] - key: kubernetes.io/os operator: In values: ["linux"] - key: karpenter.sh/capacity-type operator: In values: ["spot", "on-demand"] # - key: node.kubernetes.io/instance-type # operator: In # values: ["g5.2xlarge", "g6.2xlarge"] - key: karpenter.k8s.aws/instance-family operator: In values: ["g5", "g6"] - key: karpenter.k8s.aws/instance-gpu-count operator: In values: ["1"] nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: ep-nodeclass expireAfter: 168h # 7 * 24h = 168h limits: cpu: 64 memory: 256Gi disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 1m ```

The CPU and memory limits are set just as the others are, and leave plenty of room for the G instance specs based on the docs.

This is defined identically to the other functional NodePools, except for the G instance family specifications (particularly the newer card offerings).

When Karpenter takes this in, and I launch a pod with the necessary Kubernetes specs:

# an Argo Workflow launches this pod
      resources: # not required; I have tried this without this subset
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
    affinity: # standard; works flawlessly to route pods to the tc-base and tc-heavy NodePools
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: "{{workflow.parameters.ep_pod_tolerance}}"
              operator: "Exists"

It validates it successfully and attempts to spin up a node to serve it... to yield the following:

(click to expand) kubectl logs output of JSON, formatted ```json { "level": "DEBUG", "time": "2024-08-26T21:51:35.612Z", "logger": "controller", "caller": "scheduling/scheduler.go:220", "message": "226 out of 801 instance types were excluded because they would breach limits", "commit": "62a726c", "controller": "provisioner", "namespace": "", "name": "", "reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238", "NodePool": { "name": "ep-base" } } { "level": "ERROR", "time": "2024-08-26T21:51:35.618Z", "logger": "controller", "caller": "provisioning/provisioner.go:355", "message": "could not schedule pod", "commit": "62a726c", "controller": "provisioner", "namespace": "", "name": "", "reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238", "Pod": { "name": "e2e-test-stage-kane-p7wck-edge-pipeline-pickle-2973982407", "namespace": "argo" }, "error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule; incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule; incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [g5.2xlarge g6.2xlarge], example.com/taint-ep-base In [true] (no instance type has enough resources)", "errorCauses": [{ "error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule" }, { "error": "incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule" }, { "error": "incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-family In [g5 g6], example.com/taint-ep-base In [true] (no instance type has enough resources)" } ] } ```

The scheduler does checks to filter instance types on available limit overhead -- but no matter what set of configs I try, the provisioner just refuses to take without being more explicit about what resources are missing from the instance types it can see (even though the instance types desired very much support the small resource requirements it is reporting on).

Notes and things I have tirelessly tried to get around this:

Expected Behavior:

One of two things:

  1. The Karpenter controller provides observability into which instance types it filtered for resource selection and why, in order to correlate + disambiguate the provided error above as a config error, internal bug, or cluster bug.
  2. The Karpenter controller selects a G instance type (like a nice g4dn/g5/g6.2xlarge) to spawn, do so, then assign the pod to the node and let it work its magic.

Reproduction Steps (Please include YAML):

  1. Deploy below Karpenter resources YAML to a clean Karpenter-enabled EKS cluster (latest: v1.0.1):
    (click to expand) Karpenter resources YAML
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: ep-nodeclass
spec:
  amiFamily: AL2
  role: "bench-main-ng-eks-node-group-20240620210345707900000001"
  subnetSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  securityGroupSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  amiSelectorTerms:
    # acquired from https://github.com/awslabs/amazon-eks-ami/releases
    - name: "amazon-eks-gpu-node-1.30-v*"
  kubelet:
    maxPods: 1
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: ep-base
spec:
  template:
    metadata:
      labels:
        example.com/taint-ep-base: "true"
      annotations:
        Env: "staging"
        Project: "autotest"
    spec:
      taints:
      - key: example.com/taint-ep-base
        effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        # - key: node.kubernetes.io/instance-type
        #   operator: In
        #   values: ["g5.2xlarge", "g6.2xlarge"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g5", "g6"]
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: In
          values: ["1"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: ep-nodeclass
      expireAfter: 168h # 7 * 24h = 168h
  limits:
    cpu: 64
    memory: 256Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

  1. Attempt to run any test pod with the following Kubernetes requirements:
    (click to expand) YAML subset for requirements
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: example.com/taint-ep-base
              operator: "Exists"

  1. Observe Karpenter controller fail to provision a GPU node.

Versions:

rlindsberg commented 1 month ago

I observed the same issue as you today.

mDSaifZia commented 3 weeks ago

I am observing this bug as well. I have defined a nvidia.com/gpu resource requirement in my deployment manifest. I also have a separate gpu-nodepool which is using a bottlerocket AMI type nodeclass. The only configuration I have placed is that the instance-gpucount is 1. For some reason karpenter is rejecting a g5g.xlarge node claim which has 3000m+ cpu but is not able to schedule a deployment that requires only 150m. Please help.

mDSaifZia commented 2 weeks ago

I think I figured it out. If it's a new account, you can check your karpenter pod logs, the gpu instances might fail to launch due to MaxSpotInstances exceeded. So your node claim will then be deleted and it will be displayed that there are no instances avaiable to satisfy your requirements. You may check these docs:- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html

rlindsberg commented 2 weeks ago

the gpu instances might fail to launch due to MaxSpotInstances exceeded

Hmm, but the node pool specifies both spot and on-demand nodes. So if spot node cannot satisfy the request then it should switch to on-demand nodes. Right?

mDSaifZia commented 2 weeks ago

You're right. As per AWS docs, it should fall back to on-demand. My only other guesses come from daemonset issues. Somewhere it mentioned that it tries to add up daemonsets as well and include it in resources and if it can't be found in the available g5 g6 families defined above it will not schedule it. But even then again I doubt that is the actual cause here.