Closed ilkinmammadzada closed 3 months ago
@ilkinmammadzada Would you mind providing your nodeclass configuration as well as your deployment configuration?
@ilkinmammadzada Would you mind providing your nodeclass configuration as well as your deployment configuration?
I added EC2NodeClass details as well.
@ilkinmammadzada Can you share the NodeClaim associated with one of these requests? That should have all of the instance types that we are trying to launch with which should be helpful here to track down if the spot instance capacity that we are trying to launch with is just all more expensive than our cheapest on-demand capacity.
We have this filtering function which automatically gets rid of any spot capacity that is more expensive than the cheapest on-demand instance type. The rationale being: why would you go get a spot instance type if an on-demand instance type is cheaper and more available for the pod capacity that you need right now.
@jonathan-innis yeah sure. They were expired and deleted. I will share once new on-demands ec2s appear.
I'm also curious about WARN log line (at least 5 instance types are recommended when flexible to spot but requesting on-demand, the current provisioning request only has 1 instance type options). I can see that karpenter had 28 different instance-types in previous log line. Probably karpenter filtered them as @jonathan-innis mentioned
It happened again, you can check the example NodeClaim. Probably Unrelated with this issue but you can see that ephemeral-storage information is also wrong (ephemeral-storage: 89Gi
), I would expect 225Gi (volumeSize: 225Gi
).
You may notice region difference (us-west-2 and us-east-1). Their configurations are identical.
- apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
annotations:
karpenter.k8s.aws/tagged: "true"
creationTimestamp: "2024-02-29T17:03:48Z"
finalizers:
- karpenter.sh/termination
generateName: default-useast1bprod-
generation: 1
labels:
karpenter.k8s.aws/instance-category: m
karpenter.k8s.aws/instance-cpu: "128"
karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
karpenter.k8s.aws/instance-family: m6i
karpenter.k8s.aws/instance-generation: "6"
karpenter.k8s.aws/instance-hypervisor: nitro
karpenter.k8s.aws/instance-memory: "524288"
karpenter.k8s.aws/instance-network-bandwidth: "50000"
karpenter.k8s.aws/instance-size: 32xlarge
karpenter.sh/capacity-type: on-demand
karpenter.sh/nodepool: default-useast1bprod
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node.kubernetes.io/instance-type: m6i.32xlarge
topology.kubernetes.io/region: us-east-1
topology.kubernetes.io/zone: us-east-1b
xxx.com/pool: default
name: default-useast1bprod-xxxx
ownerReferences:
- apiVersion: karpenter.sh/v1beta1
blockOwnerDeletion: true
kind: NodePool
name: default-useast1bprod
uid: 70af2e0a-50cb-4b1e-9828-5cdace4a3735
resourceVersion: "14879954814"
uid: 0551e82c-d740-4d32-95a9-1e98e646426e
spec:
kubelet:
maxPods: 200
systemReserved:
cpu: "2"
memory: 4G
nodeClassRef:
name: default-useast1bprod
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- i4i.32xlarge
- i4i.metal
- m6i.32xlarge
- m6i.metal
- m6id.32xlarge
- m6id.metal
- m6idn.32xlarge
- m6idn.metal
- m6in.32xlarge
- m6in.metal
- r6i.32xlarge
- r6i.metal
- r6id.32xlarge
- r6id.metal
- r6idn.32xlarge
- r6idn.metal
- r6in.32xlarge
- r6in.metal
- x2idn.32xlarge
- x2idn.metal
- x2iedn.32xlarge
- x2iedn.metal
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1b
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- spot
- key: karpenter.sh/nodepool
operator: In
values:
- default-useast1bprod
- key: xxx.com/pool
operator: In
values:
- default
- key: karpenter.k8s.aws/instance-cpu
operator: Gt
values:
- "31"
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- c5
- c5d
- c5n
- c6i
- c6id
- c6in
- c7i
- i4i
- m5
- m5d
- m5dn
- m5n
- m5zn
- m6i
- m6id
- m6idn
- m6in
- m7i
- r5
- r5b
- r5dn
- r5n
- r6i
- r6id
- r6idn
- r6in
- r7i
- x2idn
- x2iedn
- x2iezn
resources:
requests:
cpu: 125200m
ephemeral-storage: 37Gi
memory: 265898Mi
pods: "20"
startupTaints:
- effect: NoSchedule
key: key
status:
allocatable:
cpu: 125610m
ephemeral-storage: 89Gi
memory: 484033846Ki
pods: "200"
vpc.amazonaws.com/pod-eni: "107"
capacity:
cpu: "128"
ephemeral-storage: 100Gi
memory: 484966Mi
pods: "200"
vpc.amazonaws.com/pod-eni: "107"
conditions:
- lastTransitionTime: "2024-03-01T09:03:48Z"
severity: Warning
status: "True"
type: Expired
- lastTransitionTime: "2024-02-29T17:07:57Z"
status: "True"
type: Initialized
- lastTransitionTime: "2024-02-29T17:03:52Z"
status: "True"
type: Launched
- lastTransitionTime: "2024-02-29T17:07:57Z"
status: "True"
type: Ready
- lastTransitionTime: "2024-02-29T17:07:37Z"
status: "True"
type: Registered
imageID: ami-xxxx
nodeName: xxxxx
providerID: aws:///us-east-1b/i-xxxxx
I just checked CreateFleet log to confirm which ec2 types were filtered. Apparently karpenter added only one ec2 type.
And it's really weird that I can't find "spot" attempt before "on-demand" (checking logs between "nodeClaim creation time" and "nodeClaim launching time").
I'm seeing an error right before on-demand launching. eventName: DescribeLaunchTemplates
-> errorCode: Client.InvalidLaunchTemplateName.NotFoundException
errorMessage: At least one of the launch templates specified in the request does not exist.
IIRC I saw the same error for the previous case as well. May it be the clue to find potential bug?
I will try to check when it happens again.
{
"CreateFleetRequest": {
"TargetCapacitySpecification": {
"DefaultTargetCapacityType": "on-demand",
"TotalTargetCapacity": 4
},
"Type": "instant",
"OnDemandOptions": {
"AllocationStrategy": "lowest-price"
},
"LaunchTemplateConfigs": {
"LaunchTemplateSpecification": {
"LaunchTemplateName": "karpenter.k8s.aws/xxx",
"Version": "$Latest"
},
"Overrides": {
"ImageId": "ami-xxx",
"AvailabilityZone": "us-east-1b",
"tag": 1,
"SubnetId": "subnet-xxx",
"InstanceType": "m6i.32xlarge"
},
"tag": 1
},
"TagSpecification": []
}
}
Unrelated with this issue but you can see that ephemeral-storage information is also wrong (ephemeral-storage: 89Gi), I would expect 225Gi (volumeSize: 225Gi)
There is an eventual consistency issue with how we compute ephemeral storage. We are tracking it in a Github Issue - https://github.com/aws/karpenter-provider-aws/issues/5756
@ilkinmammadzada Are there multiple NodePools and NodeClasses in this cluster?
@ilkinmammadzada Are there multiple NodePools and NodeClasses in this cluster?
There are several NodePools, but they are mutually exclusive (xxx.com/pool). I would like to understand why karpenter wanted to launch only following "m, i, r, x" and "32xl, metal" :
The NodePool you have provided does not match the NodeClaim that was provisioned. Can you provide the default-useast1bprod
NodePool?
Unrelated with this issue but you can see that ephemeral-storage information is also wrong (ephemeral-storage: 89Gi), I would expect 225Gi (volumeSize: 225Gi)
There is an eventual consistency issue with how we compute ephemeral storage. We are tracking it in a Github Issue - #5756
I think "eventual consistency" doesn't work here. Because karpenter needs to know correct ephemeral storage size of the host before launching. Otherwise it will not launch any instance if your unschedulable pod requests 150Gi ephemeral-storage.
You may notice region difference (us-west-2 and us-east-1). Their configurations are identical.
@engedaam as I mentioned above they are identical, only difference is region.
You may notice region difference (us-west-2 and us-east-1). Their configurations are identical.
There are several NodePools, but they are mutually exclusive (xxx.com/pool). I would like to understand why karpenter wanted to launch only following "m, i, r, x" and "32xl, metal" :
Can you also share your deployment? As Karpenter is responding to the pending pods, it would be good to know the pods that Karpenter was trying to provision this nodeClaim.
fwiw. We had ~700 unschedulable pods during that time. But I found all of them and confirmed that none of them has any specific requirement (like instance-type, instance-category and etc).
@ilkinmammadzada I want to take a step out here to understand a bit better why you are wanting spot over on-demand here and why not just constrain the NodePool to only support spot instance types? My assumption based on the information that you reported is that we are scheduling too many pods to a single node, which is causing the only capacity that's available to be on-demand.
I'm assuming that this is a concern to you because it would be cheaper to have two, smaller spot nodes that schedule all of the capacity vs. having a single on-demand node to schedule all of the capacity.
@jonathan-innis we have included on-demand to prevent an outage due to spot instance capacity problems. We were thinking that karpenter will launch on-demand only if there is no spot instance available. (If your Karpenter NodePool specifies allows both Spot and On-Demand capacity, Karpenter will fallback to provision On-Demand capacity if there is no Spot capacity available. - https://karpenter.sh/docs/faq/#what-if-there-is-no-spot-capacity-will-karpenter-use-on-demand)
Yes, you are right, our concern is only cost. We think it was possible to launch some spot ec2s to repond those unschedulable pods.
There are several NodePools, but they are mutually exclusive (xxx.com/pool). I would like to understand why karpenter wanted to launch only following "m, i, r, x" and "32xl, metal" :
From the nodeClaim that you shared earlier, the workload needs -
allocatable: cpu: 125610m ephemeral-storage: 89Gi memory: 484033846Ki pods: "200" vpc.amazonaws.com/pod-eni: "107"
Upon filtering the instance-family(provided in the nodepool) and the available offerings that satisfy these constraints, we are left with instance-types "m, i, r, x" and "32xl, metal". Karpenter further filters these instance types to remove metal if more appropriate instance types are available.
Did you also see any logs mentioning anything about "UnfulfillableCapacity"
? Such instance types would also be dropped.
Upon filtering the instance-family(provided in the nodepool) and the available offerings that satisfy these constraints, we are left with instance-types "m, i, r, x" and "32xl, metal".
Looking at what happened here, I think you may be hitting an issue with Karpenter packing you so tightly that we are using such big instance types that there is no spot availability anymore for these instance types. If your NodePool is flexible to both spot and on-demand, we don't know in our scheduling algorithm that putting you onto on-demand only instance types isn't what you wanted here. The typical way that we tell users to configure fallback is through "weight" since that best communicates to the scheduler that it should stop continuing to pack pods on a node, since it would then fall outside of the capacity type that you want.
Another option here (if you didn't want to strictly break up NodePools into different weights) would be to use the new minValues
field that we shipped recently with our requirements block. You could add this field to your node.kubernetes.io/instance-type
and make it something like 30
or 40
so that Karpenter never overly constrains your instance types when it is scheduling out your pods.
From the nodeClaim that you shared earlier, the workload needs -
allocatable: cpu: 125610m ephemeral-storage: 89Gi memory: 484033846Ki pods: "200" vpc.amazonaws.com/pod-eni: "107"
I think it doesn't exactly mean our workload needed those resources. IIRC launched nodeclaim->allocatable means that how much resources recently launched nodeClaim will have (from karpenter view)
Did you also see any logs mentioning anything about "UnfulfillableCapacity"? Such instance types would also be dropped.
Is it okay to check karpenter_nodeclaims_terminated{reason="insufficient_capacity"} metric or would checking logs be more accurate? Thank you!
The typical way that we tell users to configure fallback is through "weight" since that best communicates to the scheduler that it should stop continuing to pack pods on a node, since it would then fall outside of the capacity type that you want.
Yeah, I was also thinking about it, but didn't try yet. Because I'm seeing that the default-usXXXprod
nodepool fails quite enough to provisioner a new ec2 (https://github.com/aws/karpenter-provider-aws/issues/5756). I suspected "spot-default" would fail and "on-demand-default" would be success and we would ended-up having more on-demand than now. Worth to try, I will try.
Another option here (if you didn't want to strictly break up NodePools into different weights) would be to use the new minValues field that we shipped recently with our requirements block. You could add this field to your node.kubernetes.io/instance-type and make it something like 30 or 40 so that Karpenter never overly constrains your instance types when it is scheduling out your pods.
hmm, we started to upgrade karpenter but haven't experimented the new feature. Will try this one as well.
Thank you!
I suspected "spot-default" would fail and "on-demand-default" would be success and we would ended-up having more on-demand than now.
I don't think this would be the behavior here. It would actually give us more information to use in the scheduler if you used weighting here. We would try the spot NodePool and if we got Insufficient Capacity on the launch, we would go and try to reschedule the pods again, but we would re-try the spot NodePool all over again, but this time with smaller instance types since we know that we couldn't get the bigger ones. We would only fallback to on-demand once we had no more options for spot on that first NodePool.
If you are using a single NodePool with fallback, the behavior is a bit different a little more unexpected. In that case, you get the behavior that you are seeing where we will keep packing pods and constrain down without considering the fact that we may remove instance types that will make it so that we are only launching on-demand.
@jonathan-innis thanks a lot for the great explanation! I will try this week
Is it okay to check karpenter_nodeclaims_terminated{reason="insufficient_capacity"} metric or would checking logs be more accurate?
I think that the logs is probably the best way. Can you confirm that you saw InsufficientCapacity errors around this time? We should log the offering that we are removing inside of the DEBUG logging. If you are just running INFO, you won't see it but if you have DEBUG, you should look for a log like "removing offering from offerings".
After discussing a bit more offline, I want to summarize the issues that we think that we are hitting here:
resource.Quantity
incorrectly. This includes the volumeSize
field in the blockDeviceMappings
as well as the value fields in the kubeReserved
and systemReserved
portions of the KubeletConfiguration. The practical impact of this is that NodeClasses that are similar but not the same may get returned from the instance type cache that we use to save memory for the number of instance type structs that can be stored in memory at any given time. This was fixed in #5816 and will be back-ported in patch releases all the way back to the v0.32.x minor version.instanceStorePolicy
caching issue (#5756) may also have been a culprit here since that may also have been returning inconsistent information across NodePools/NodeClasses.Thanks @jonathan-innis ! we upgraded to the new version 0.35.2 and it fixed the problem
Description
Observed Behavior:
Karpenter launches on-demand instances even thought spot market has quite enough ec2 to respond our requirements
message: {"level":"INFO","time":"2024-02-27T17:43:27.399Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"17d6c05","nodepool":"default-uswest2bprod","nodeclaim":"default-uswest2bprod-xbxdw","requests":{"cpu":"120","ephemeral-storage":"45Gi","memory":"194860Mi","pods":"22"},"instance-types":"c6i.32xlarge, c6i.metal, c6id.32xlarge, c6id.metal, c6in.32xlarge and 23 other(s)"}
message: {"level":"WARN","time":"2024-02-27T17:43:30.235Z","logger":"controller.nodeclaim.lifecycle","message":"at least 5 instance types are recommended when flexible to spot but requesting on-demand, the current provisioning request only has 1 instance type options","commit":"17d6c05","nodeclaim":"default-uswest2bprod-xbxdw"}
message: {"level":"INFO","time":"2024-02-27T17:43:31.886Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"17d6c05","nodeclaim":"default-uswest2bprod-xbxdw","provider-id":"aws:///us-west-2b/i-xxxx","instance-type":"c6i.32xlarge","zone":"us-west-2b","capacity-type":"on-demand","allocatable":{"cpu":"125610m","ephemeral-storage":"206336Mi","memory":"235731254Ki","pods":"200","vpc.amazonaws.com/pod-eni":"107"}}
message: {"level":"INFO","time":"2024-02-27T17:47:05.891Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"17d6c05","nodeclaim":"default-uswest2bprod-xbxdw","provider-id":"aws:///us-west-2b/i-xxxx","node":"yyyy"}
Expected Behavior:
I would expect launching on-demand only if there is no spot instances with given requirements. (Karpenter prioritizes Spot offerings if the NodePool allows Spot and on-demand instances)
Reproduction Steps (Please include YAML):
Versions:
Chart Version: v0.34
Kubernetes Version (
kubectl version
): v1.23.17Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment