Open csm-kb opened 2 months ago
I observed the same issue as you today.
I am observing this bug as well. I have defined a nvidia.com/gpu resource requirement in my deployment manifest. I also have a separate gpu-nodepool which is using a bottlerocket AMI type nodeclass. The only configuration I have placed is that the instance-gpucount is 1. For some reason karpenter is rejecting a g5g.xlarge node claim which has 3000m+ cpu but is not able to schedule a deployment that requires only 150m. Please help.
I think I figured it out. If it's a new account, you can check your karpenter pod logs, the gpu instances might fail to launch due to MaxSpotInstances exceeded. So your node claim will then be deleted and it will be displayed that there are no instances avaiable to satisfy your requirements. You may check these docs:- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html
the gpu instances might fail to launch due to MaxSpotInstances exceeded
Hmm, but the node pool specifies both spot and on-demand nodes. So if spot node cannot satisfy the request then it should switch to on-demand nodes. Right?
You're right. As per AWS docs, it should fall back to on-demand. My only other guesses come from daemonset issues. Somewhere it mentioned that it tries to add up daemonsets as well and include it in resources and if it can't be found in the available g5 g6 families defined above it will not schedule it. But even then again I doubt that is the actual cause here.
Description
Context:
Hey! I have Karpenter deployed very neatly to an EKS cluster using FluxCD to automatically manage Helm charts:
(click to expand) Helm release for Karpenter
```yaml # including HelmRepository here, even though it is in a separate file apiVersion: source.toolkit.fluxcd.io/v1beta2 kind: HelmRepository metadata: name: karpenter namespace: flux-system spec: type: "oci" url: oci://public.ecr.aws/karpenter interval: 30m --- apiVersion: v1 kind: Namespace metadata: name: karpenter --- apiVersion: helm.toolkit.fluxcd.io/v2 kind: HelmRelease metadata: name: karpenter-crd namespace: karpenter spec: interval: 5m chart: spec: chart: karpenter-crd version: ">=1.0.0 <2.0.0" sourceRef: kind: HelmRepository name: karpenter namespace: flux-system install: remediation: retries: 3 values: webhook: enabled: true serviceName: karpenter serviceNamespace: karpenter port: 8443 --- apiVersion: helm.toolkit.fluxcd.io/v2 kind: HelmRelease metadata: name: karpenter namespace: karpenter spec: interval: 5m chart: spec: chart: karpenter version: ">=1.0.0 <2.0.0" sourceRef: kind: HelmRepository name: karpenter namespace: flux-system install: remediation: retries: 3 values: webhook: enabled: true port: 8443 replicas: 2 logLevel: debug controller: resources: requests: cpu: 1 memory: 1Gi limits: cpu: 1 memory: 1Gi settings: clusterName: "bench-cluster" interruptionQueue: "Karpenter-bench-cluster" serviceAccount: create: true annotations: eks.amazonaws.com/role-arn: "arn:aws:iam::I then have three
NodePool
s (and associatedEC2NodeClass
es) that take different workloads, depending on what pods get launched with what affinities/taints to request where they go. The twoNodePool
s that rely on normal compute instance types like C/M/R work very well, and Karpenter works flawlessly to scale the node pools and serve those pods!However...
Observed Behavior:
The third
NodePool
is for workloads that require a G instance with NVIDIA compute to run.Simple enough, right? YAML:
(click to expand) Karpenter resource definition YAML
```yaml apiVersion: karpenter.k8s.aws/v1 kind: EC2NodeClass metadata: name: ep-nodeclass spec: amiFamily: AL2 role: "bench-main-ng-eks-node-group-20240620210345707900000001" subnetSelectorTerms: - tags: "karpenter.sh/discovery-bench-cluster": "true" securityGroupSelectorTerms: - tags: "karpenter.sh/discovery-bench-cluster": "true" amiSelectorTerms: # acquired from https://github.com/awslabs/amazon-eks-ami/releases - name: "amazon-eks-gpu-node-1.30-v*" kubelet: maxPods: 1 --- apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: ep-base spec: template: metadata: labels: example.com/taint-ep-base: "true" annotations: Env: "staging" Project: "autotest" spec: taints: - key: example.com/taint-ep-base effect: NoSchedule requirements: - key: kubernetes.io/arch operator: In values: ["amd64"] - key: kubernetes.io/os operator: In values: ["linux"] - key: karpenter.sh/capacity-type operator: In values: ["spot", "on-demand"] # - key: node.kubernetes.io/instance-type # operator: In # values: ["g5.2xlarge", "g6.2xlarge"] - key: karpenter.k8s.aws/instance-family operator: In values: ["g5", "g6"] - key: karpenter.k8s.aws/instance-gpu-count operator: In values: ["1"] nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: ep-nodeclass expireAfter: 168h # 7 * 24h = 168h limits: cpu: 64 memory: 256Gi disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 1m ```The CPU and memory limits are set just as the others are, and leave plenty of room for the G instance specs based on the docs.
This is defined identically to the other functional
NodePool
s, except for the G instance family specifications (particularly the newer card offerings).When Karpenter takes this in, and I launch a pod with the necessary Kubernetes specs:
It validates it successfully and attempts to spin up a node to serve it... to yield the following:
(click to expand) kubectl logs output of JSON, formatted
```json { "level": "DEBUG", "time": "2024-08-26T21:51:35.612Z", "logger": "controller", "caller": "scheduling/scheduler.go:220", "message": "226 out of 801 instance types were excluded because they would breach limits", "commit": "62a726c", "controller": "provisioner", "namespace": "", "name": "", "reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238", "NodePool": { "name": "ep-base" } } { "level": "ERROR", "time": "2024-08-26T21:51:35.618Z", "logger": "controller", "caller": "provisioning/provisioner.go:355", "message": "could not schedule pod", "commit": "62a726c", "controller": "provisioner", "namespace": "", "name": "", "reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238", "Pod": { "name": "e2e-test-stage-kane-p7wck-edge-pipeline-pickle-2973982407", "namespace": "argo" }, "error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule; incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule; incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [g5.2xlarge g6.2xlarge], example.com/taint-ep-base In [true] (no instance type has enough resources)", "errorCauses": [{ "error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule" }, { "error": "incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule" }, { "error": "incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-family In [g5 g6], example.com/taint-ep-base In [true] (no instance type has enough resources)" } ] } ```The
scheduler
does checks to filter instance types on available limit overhead -- but no matter what set of configs I try, theprovisioner
just refuses to take without being more explicit about what resources are missing from the instance types it can see (even though the instance types desired very much support the small resource requirements it is reporting on).Notes and things I have tirelessly tried to get around this:
us-east-1
.nvidia.com/gpu
resource requirement to try and rule that out, to no avail.cpu: 1000
andmemory: 1024Gi
(and removed thenvidia.com/gpu
limit I had at one point) and watched the filtered instance type count decrease; observed the same issue.EC2NodeClass
AMI selection to use the latest version of any of the three supported types (AL2, AL2023, Bottlerocket) one after another; observed the same issue.nvidia-k8s-device-plugin
was provisioned and active to the latest version inkube-system
namespace (there is a manual EC2 G-instance node group in this cluster that is also used for active workloads as a live workaround).Expected Behavior:
One of two things:
error
above as a config error, internal bug, or cluster bug.g4dn/g5/g6.2xlarge
) to spawn, do so, then assign the pod to the node and let it work its magic.Reproduction Steps (Please include YAML):
v1.0.1
):(click to expand) Karpenter resources YAML
(click to expand) YAML subset for requirements
Versions:
Chart Version: latest (
">=1.0.0 <2.0.0"
)Kubernetes Version (
kubectl version
):Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment