Closed thelabdude closed 1 year ago
Can you do a describe on one of the nodes that launched so that we can look at the resources that the node has after launch
Yes, we have a few daemonsets on these nodes but they are tiny. Here's one of the nodes that came up:
Name: ip-xxx-yy-187-131.us-west-2.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m6id.xlarge
beta.kubernetes.io/os=linux
eks.amazonaws.com/capacityType=SPOT
failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2c
k8s.io/cloud-provider-aws=96a469b634ca7e71303fa61fa2302c91
karpenter.k8s.aws/instance-ami-id=ami-0173eacf6deadbace
karpenter.k8s.aws/instance-category=m
karpenter.k8s.aws/instance-cpu=4
karpenter.k8s.aws/instance-encryption-in-transit-supported=true
karpenter.k8s.aws/instance-family=m6id
karpenter.k8s.aws/instance-generation=6
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-local-nvme=237
karpenter.k8s.aws/instance-memory=16384
karpenter.k8s.aws/instance-network-bandwidth=1562
karpenter.k8s.aws/instance-pods=58
karpenter.k8s.aws/instance-size=xlarge
karpenter.sh/capacity-type=spot
karpenter.sh/initialized=true
karpenter.sh/provisioner-name=karp-spot
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-xxx-yy-187-131.us-west-2.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=m6id.xlarge
topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2c
vpc.amazonaws.com/has-trunk-attached=false
Annotations: alpha.kubernetes.io/provided-node-ip: xxx.yy.187.131
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 31 Mar 2023 13:01:25 -0600
Taints: tolerates-spot=true:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-xxx-yy-187-131.us-west-2.compute.internal
AcquireTime: <unset>
RenewTime: Fri, 31 Mar 2023 13:02:35 -0600
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 31 Mar 2023 13:02:25 -0600 Fri, 31 Mar 2023 13:02:14 -0600 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 31 Mar 2023 13:02:25 -0600 Fri, 31 Mar 2023 13:02:14 -0600 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 31 Mar 2023 13:02:25 -0600 Fri, 31 Mar 2023 13:02:14 -0600 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 31 Mar 2023 13:02:25 -0600 Fri, 31 Mar 2023 13:02:25 -0600 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: xxx.yy.187.131
Hostname: ip-xxx-yy-187-131.us-west-2.compute.internal
InternalDNS: ip-xxx-yy-187-131.us-west-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 4
ephemeral-storage: 231332304Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16119004Ki
pods: 58
vpc.amazonaws.com/pod-eni: 18
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 3920m
ephemeral-storage: 212122109190
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15102172Ki
pods: 58
vpc.amazonaws.com/pod-eni: 18
System Info:
Machine ID: ec28d4981141e106d3637450a82dc2bc
System UUID: ec20e910-20e8-51dc-e75d-f9d759bbd4cb
Boot ID: 5ca81981-a600-4a67-b95f-f02e3c2560fb
Kernel Version: 5.4.228-132.418.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.6
Kubelet Version: v1.23.15-eks-49d8fe8
Kube-Proxy Version: v1.23.15-eks-49d8fe8
ProviderID: aws:///us-west-2c/i-0435cb3cb4c7effbd
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system aws-node-wmkgv 30m (0%) 0 (0%) 0 (0%) 0 (0%) 76s
kube-system ebs-csi-node-ksjf8 150m (3%) 300m (7%) 120Mi (0%) 768Mi (5%) 77s
kube-system kube-proxy-c5q9g 100m (2%) 0 (0%) 0 (0%) 0 (0%) 77s
prometheus-stack mon-prometheus-node-exporter-978sf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 76s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 280m (7%) 300m (7%)
memory 120Mi (0%) 768Mi (5%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
vpc.amazonaws.com/pod-eni 1 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeAccepted 76s yunikorn node ip-xxx-yy-187-131.us-west-2.compute.internal is accepted by the scheduler
Normal RegisteredNode 73s node-controller Node ip-xxx-yy-187-131.us-west-2.compute.internal event: Registered Node ip-xxx-yy-187-131.us-west-2.compute.internal in Controller
Normal Starting 28s kubelet Starting kubelet.
Warning InvalidDiskCapacity 28s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 28s (x3 over 28s) kubelet Node ip-xxx-yy-187-131.us-west-2.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 28s (x3 over 28s) kubelet Node ip-xxx-yy-187-131.us-west-2.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 28s (x3 over 28s) kubelet Node ip-xxx-yy-187-131.us-west-2.compute.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 28s kubelet Updated Node Allocatable limit across pods
Normal Starting 24s kube-proxy Starting kube-proxy.
Normal NodeReady 17s kubelet Node ip-xxx-yy-187-131.us-west-2.compute.internal status is now: NodeReady
What does your AWSNodeTemplate look like?
The only differences in the AWSNodeTemplate between spot and on-demand are 1) the spot one has a script in userData that mounts the ephemeral disk and 2) the on-demand declares a blockDeviceMapping for mounting an EBS vol
I removed the ephemeral-storage: 5Gi
from my deployment spec and now Karpenter is only allocating a single r6id.xlarge
instance initially. So is there anything I need to specify to tell Karpenter I'm using the ephemeral disk for ephemeral storage?
the spot one has a script in userData that mounts the ephemeral disk
Are you mounting an ephemeral-storage disk that Karpenter is unaware of? What's most likely happening here is that Karpenter assumes your ephemeral-storage is 20Gi of capacity by default if you don't specify blockDeviceMappings
in your AWSNodeTemplate. This means, when its scheduling, that's what it will assume and why it's breaking up your workloads into separate nodes.
Once the node comes up, then it sees that there's actually ~220Gi of storage so it's able to consolidate down all the nodes that it just launched.
Yes, we're mounting the NVMe disks that come with d
instances. Seems like Karpenter should support this
Yes, we're mounting the NVMe disks that come with d instances. Seems like Karpenter should support this
Agreed, we're tracking this issue here #2723. It's a bit complex because we either have to support instance type overrides or some sort of annotation-based mechanism like CAS has for understanding what your ephemeral-storage will actually be or we have to do the mounting for you.
@bwagner5 was actually doing some work in the AL2 AMI to automatically RAID0 the instance volume stores and mount them by default, which would mean, that when that AMI was released and widely adopted, Karpenter could at-minimum assume the volume size for those instances.
Do you have any thoughts or expectations on how you would like to see Karpenter handle this case?
Unfortunately, this is the second time this 20Gi
default has bit me :-( If you look at the log I posted, there's zero indication that more instances are needed b/c of a pod's ephemeral-storage
request didn't fit into the default 20Gi
.
Part of the disconnect here is I thought Karpenter looks up metadata in AWS; I had to go through the code again to realize Karpenter does not look up the instance storage for specified types. Can it not look that up in some metadata service?
I don't have a strong opinion on how this should be solved right now, need to read through all the links related to #2723 more carefully. The AMI approach sounds promising.
Naively though, why not just let me specify the mapping of instance type / size to ephemeral storage in the Provisioner? People can end up doing all kinds of funky things with these disks so having an optional mapping config where I can specify xlarge = 237, 2xlarge = 474
etc .. that's at least better than what I have now of not using d
instances and also having to fit a single EBS vol size for all instance sizes.
Renaming this issue because other than #2723 , I think the logs need to report ephemeral disk as part of the instance selection during scale-up esp. when the 20Gi default has been applied. If you look at the logs I posted earlier:
2023-03-31T18:20:38.857Z INFO controller.provisioner launching machine with 3 pods requesting {"cpu":"526m","ephemeral-storage":"15Gi","memory":"1125Mi","pods":"7","vpc.amazonaws.com/pod-eni":"1"} from types c6id.8xlarge, m6id.2xlarge, r6id.xlarge, r6id.2xlarge, r6id.16xlarge and 19 other(s) {"commit": "7131be2-dirty", "provisioner": "karp-spot"}
2023-03-31T18:20:38.866Z INFO controller.provisioner launching machine with 3 pods requesting {"cpu":"526m","ephemeral-storage":"15Gi","memory":"1125Mi","pods":"7","vpc.amazonaws.com/pod-eni":"1"} from types c6id.8xlarge, m6id.2xlarge, r6id.xlarge, r6id.2xlarge, r6id.16xlarge and 19 other(s) {"commit": "7131be2-dirty", "provisioner": "karp-spot"}
2023-03-31T18:20:38.876Z INFO controller.provisioner launching machine with 3 pods requesting {"cpu":"526m","ephemeral-storage":"15Gi","memory":"1125Mi","pods":"7","vpc.amazonaws.com/pod-eni":"1"} from types c6id.8xlarge, m6id.2xlarge, r6id.xlarge, r6id.2xlarge, r6id.16xlarge and 19 other(s) {"commit": "7131be2-dirty", "provisioner": "karp-spot"}
2023-03-31T18:20:38.887Z INFO controller.provisioner launching machine with 3 pods requesting {"cpu":"526m","ephemeral-storage":"15Gi","memory":"1125Mi","pods":"7","vpc.amazonaws.com/pod-eni":"1"} from types c6id.8xlarge, m6id.2xlarge, r6id.xlarge, r6id.2xlarge, r6id.16xlarge and 19 other(s) {"commit": "7131be2-dirty", "provisioner": "karp-spot"}
2023-03-31T18:20:38.897Z INFO controller.provisioner launching machine with 3 pods requesting {"cpu":"526m","ephemeral-storage":"15Gi","memory":"1125Mi","pods":"7","vpc.amazonaws.com/pod-eni":"1"} from types c6id.8xlarge, m6id.2xlarge, r6id.xlarge, r6id.2xlarge, r6id.16xlarge and 19 other(s) {"commit": "7131be2-dirty", "provisioner": "karp-spot"}
2023-03-31T18:20:38.908Z INFO controller.provisioner launching machine with 3 pods requesting {"cpu":"526m","ephemeral-storage":"15Gi","memory":"1125Mi","pods":"7","vpc.amazonaws.com/pod-eni":"1"} from types c6id.8xlarge, m6id.2xlarge, r6id.xlarge, r6id.2xlarge, r6id.16xlarge and 19 other(s) {"commit": "7131be2-dirty", "provisioner": "karp-spot"}
2023-03-31T18:20:38.918Z INFO controller.provisioner launching machine with 3 pods requesting {"cpu":"526m","ephemeral-storage":"15Gi","memory":"1125Mi","pods":"7","vpc.amazonaws.com/pod-eni":"1"} from types c6id.8xlarge, m6id.2xlarge, r6id.xlarge, r6id.2xlarge, r6id.16xlarge and 19 other(s) {"commit": "7131be2-dirty", "provisioner": "karp-spot"}
There's nothing to indicate that the instances have the default 20Gi ephemeral storage limit imposed. Of course, if #2723 gets fixed, maybe this issue just goes away.
@thelabdude We're now logging the selected instance type capacity after the capacity is launched (#3695). Hopefully this helps track down issues in the future around ephemeral-storage constraints.
I think I'm going to close this at this point since we'll track the instance storage ask with #2723. Feel free to re-open if you additional thoughts or comments.
Version
Karpenter Version: v0.27.1
Kubernetes Version: v1.23.16-eks-48e63af
Expected Behavior
Karpenter is initially over-allocating instances (poor fit to pending pods) for a deployment that prefers spot instances; the behavior for on-demand seems correct (or at least a better initial fit). Consolidation gets the fit right but this leads to unnecessary pod evictions very soon after the pods start. My sense tells me there's just something off with Karpenter's initial calculations with spot here? idk ...
Actual Behavior
Doing some basic comparisons of behavior between
spot
andon-demand
using a simple deployment. Withspot
, I'm seeing Karpenter spin up way too many instances and then consolidating them down to a right-sized instance in a second pass (see logs).When using
on-demand
, Karpenter seems to do a better fit initially. The main difference between my spot and on-demand configuration are I'm usingd
instances for spot (e.g.r6id
) but non-d
for on-demand (EBS only instance types liker6i
). My logic here is since Karpenter currently forces me to have a one-size fits all EBS vol for all instance sizes (https://github.com/aws/karpenter/issues/2723), then I'll take the price hit withd
instance spots vs. EBS only spots. That's probably irrelevant to this issue but wanted to mention just in case.When I initially submit the deployment that prefers spot, I see these instances being started by Karpenter:
After about 2 minutes, Karpenter consolidates down to a single node (see logs below showing this activity):
Steps to Reproduce the Problem
Here's my simple test deployment:
Note that I mutate the pod using OPA to add:
And a toleration for a taint my
AWSNodeTemplate
adds to the spot nodes.Resource Specs and Logs
Here's the provisioner for spot:
here's the relevant log section showing the activity when the deployment is added to the cluster:
Community Note