Closed felix-zhe-huang closed 2 years ago
what about https://github.com/kubernetes-sigs/descheduler which already implements some of this?
i've been using descheduler for this. But mind one can only use HighNodeUtilization
. If you enable the other strategies, you're in for a great time of karpenter illegally binding pods to nodes with taints they don't tolerate (the unready taints for example), descheduler terminating those pods, karpenter spinning up new pods due to #1044, descheduler again evicting those pods ...
Also mind that this is really sub-optimal compared to what Cluster Autoscaler (CAS) does. With CAS I can set a threshold of 70-80% and it will really condense the cluster. CAS gets scheduling mostly correct because it simulates kube-scheduler. Descheduler, however, you need really low thresholds. Because it will just evict everything on the node once you hit that threshold, hoping it will rescheduler elsewhere. So with aggressive limits, it will just terminate pods over and over and over again.
I think the CAS approach is the correct one. I need to really condense the cluster, and the only way to do that is to simulate what will happen after evictions. At least evicting pods that will just reschedule is meaningless.
I also think changing any instance types is an overkill. If I have 1 remaining node with some CPUs to spare, that is fine. At least I would like a strategy that tries to put pods on existing nodes first, then check on the remaining node (the one with the lowest utilisation), if another instance type fits.
A more fancy thing to do, which I think comes much later, is to look at the overall mem vs cpu balance in the cluster. E.g if the cluster shifts from generally cpu bound to memory bound, it would be nice if karpenter could adjust for that. But then we hit the knapsack-like problem that can get a bit tricky to work out.
Currently Karpenter will scale down empty nodes automatically. However, it does not actively move pods around to create empty nodes We are more worried to use karpenter because of this missing major feature for ensuring that we run right sized instance at all time. Can someone let me know the progress here we are in need this feature badly
I'd like Karpenter to terminate (or request termination) of a node when it has a low density of pods and there is another node which could take the nodes (#1491).
Karpenter looks exciting, but for large-scale K8 clusters deployments, this is pretty much a prerequisite.
Is there any discussion or design document about the possible approaches that can be taken for bin packing of the existing workload?
We're currently laying the foundation by implementing the remaining scheduling spec (affinity/antiaffinity). After that, we plan to make rapid progress on defrag design. I expect we'll start small (e.g. 1 node at a time, simple compactions) and get more sophisticated over time. This is pending design, though.
In the short term, you can combine poddisruptionbudgets
and ttlsecondsuntilexpired
to achieve soft defrag.
In the end my requirement is the same as the others in this thread, but one of the reasons for the requirement, which I have not yet seen captured, is that minimizing (to some reasonable limit) the number of hosts can impact the cost of anything billed per-host (e.g., DataDog APM).
Hi, this is really important, needs to have the same functionality as cluster-autoscaler. This is preventing me from switching to Karpenter.
We alose need this feature, the cluster's cost increase when we migrate to karpenter due to the increased nodes count.
In our scenario, we use karpenter in a EMR on EKS cluster, which creates CR(jobbatch) on EKS cluster and those CR will create pods, we can not just add a poddisruptionbudgets for those workloads simply.
We also need this feature. In our scenario, we would like to terminate under utilized nodes by actively move pods around to create empty nodes.
Any idea about its release date ?
In the short term, you can combine
poddisruptionbudgets
andttlsecondsuntilexpired
to achieve soft defrag.
Am I understanding correctly: this could in theory terminate pods that had recently been spun up on an existing node?
Are there any workarounds right now to move affected pods to another node so there is no interruption? Or would this only be achieved with the tracked feature?
We've laid down most of the groundwork for this feature with in flight nodes, not binding pods, topology, affinity, etc. This is one of our highest priorities right now, and we'll be sure to make early builds available to ya'll. If you're interested in discussing the design, feel free to drop by https://github.com/aws/karpenter/blob/main/WORKING_GROUP.md
As written, the feature request wouldn't consider an approach where it becomes cost effective to run smaller nodes. I'm imagining a case where the spot market does not provide large instances but there is spot capacity for smaller
Does this feature request need to consider any of:
Some of those might be designed for but not initially implemented.
- load skewing (as defined in Energy-aware server provisioning and load dispatching for connection-intensive internet services to pack the workload into the smallest number of nodes. Load skewing is similar to the `High node utilization' strategy for the Kubernetes descheduler, but explicitly considers marking compute nodes for draining when a cluster scale-in is desired.
One method we're looking at using for consolidation is by removing nodes when the node's pods can run on other nodes in the cluster. Unless I misunderstand, this is effectively "load skewing" as we are reducing the number of nodes to concentrate the workloads on the remaining nodes.
- rate limits on evictions (to manage the impact of shifting many small existing Pods onto a new large node if that becomes the most appropriate option, or vice versa)
Kubernetes has a concept of a rate limit on pod evictions and it's controlled by the Pod Disruption Budget that selects the pods. Our consolidation mechanisms will adhere to any PDBs that are defined.
- interaction with pre-emptible / background jobs
- let's say you have a workload that wants lots of RAM, say 3TiB of it per replica. But you also have another use for the remaining CPU / RAM in the instance. You want to manage some easily preemptible Pods to eat up the available spare resource, but you also don't want those background jobs to influence the overall cluster scaling in any way.
We don't have support for this currently. I believe that CAS has a similar feature through its expendable-pods-priority-cutoff
, can you write a separate feature request issue for this?
- node fit and interaction with scheduling extension mechanisms - - see https://github.com/kubernetes/kubernetes/issues/48657 for discussion of how these might look.
I believe that this will be a problem for any autoscaler. Using different schedulers which follow different rules or extensions mechanisms that change the scheduling rules inevitably will lead to issues.
For load skewing, before you remove a node from the cluster, you gracefully drain the workloads that are running on it. For example, a scheduler / descheduler could use the eviction API to trigger the removal of a Pod, wait for that to complete, and then avoid placing a new Pod onto the node that is earmarked for draining.
Our consolidation mechanisms will adhere to any PDBs that are defined.
PDB is a useful mechanism and definitely worth accounting for. For nodes that run very large Pods, maybe also consider the impact on API Priority and Fairness. Imagine if draining a small number of large nodes that do work for one namespace led to higher Pod creation latency for all namespaces.
For load skewing, before you remove a node from the cluster, you gracefully drain the workloads that are running on it. For example, a scheduler / descheduler could use the eviction API to trigger the removal of a Pod, wait for that to complete, and then avoid placing a new Pod onto the node that is earmarked for draining.
Consolidation will work that way. We'll use the Evict
API the same way we currently use it for node expiration.
Things are subject to change regarding the implementation and this is not intended for any production use, but if anyone is interested in trying out consolidation you should be able to edit and use the following:
export CLUSTER_NAME="<INSERT_CLUSTER_NAME>"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export KARPENTER_IAM_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/${CLUSTER_NAME}-karpenter"
export CLUSTER_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.endpoint" --output text)"
export COMMIT="63e6d43d0c6b30b260c63dc01afcb58baabc8020"
# install the snapshot
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version v0-${COMMIT} --namespace karpenter \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
--set clusterName=${CLUSTER_NAME} \
--set clusterEndpoint=${CLUSTER_ENDPOINT} \
--set aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
--wait
# update the provisioner CRD to have the consolidation parameter
kubectl replace -f https://raw.githubusercontent.com/aws/karpenter/${COMMIT}/charts/karpenter/crds/karpenter.sh_provisioners.yaml
To enable consolidation for a provisioner, replace the ttlSecondsAfterEmpty
parameter with a consolidation section:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
consolidation:
enabled: true
If you test this and want to roll back to a previous version, ensure that you remove the consolidation parameter from your provisioner before uninstalling.
@tzneal Many thanks for the work on this! I wanted to give it a try in a development environment but I have had issues with the karpenter certificate. The controller reports it does not trust the issuer of the cert when accessing the webhook. I never had this issue with the official chart. Did you face this same issue in any of your tests?
Here the error logs from the webhook container:
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.021Z ERROR webhook.DefaultingWebhook Reconcile error {"commit": "47afd62", "knative.dev/traceid": "98e9ea47-e4e6-49ac-8580-9ca9dfa0fbb9", "knative.dev/key": "defaulting.webhook.karpenter.k8s.aws", "duration": "52.788µs", "error": "secret \"karpenter-cert\" is missing \"ca-cert.pem\" key"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.021Z ERROR webhook.ValidationWebhook Reconcile error {"commit": "47afd62", "knative.dev/traceid": "4e4bc2bd-0bf2-4b73-801a-e8f558c47996", "knative.dev/key": "validation.webhook.provisioners.karpenter.sh", "duration": "50.594µs", "error": "secret \"karpenter-cert\" is missing \"ca-cert.pem\" key"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.097Z ERROR webhook.ValidationWebhook Reconcile error {"commit": "47afd62", "knative.dev/traceid": "c43895bd-7e88-45d0-80ba-dd8596ed9e1f", "knative.dev/key": "validation.webhook.provisioners.karpenter.sh", "duration": "39.71µs", "error": "secret \"karpenter-cert\" is missing \"ca-cert.pem\" key"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.098Z ERROR webhook.DefaultingWebhook Reconcile error {"commit": "47afd62", "knative.dev/traceid": "023d4057-bfeb-4b50-8458-b675b7fd369e", "knative.dev/key": "defaulting.webhook.karpenter.k8s.aws", "duration": "27.876µs", "error": "secret \"karpenter-cert\" is missing \"ca-cert.pem\" key"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.297Z ERROR webhook.DefaultingWebhook Reconcile error {"commit": "47afd62", "knative.dev/traceid": "dbecc636-6c9c-435d-9060-9fac1791d575", "knative.dev/key": "defaulting.webhook.karpenter.k8s.aws", "duration": "188.82357ms", "error": "failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"defaulting.webhook.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.297Z ERROR webhook.ValidationWebhook Reconcile error {"commit": "47afd62", "knative.dev/traceid": "0c0d2c22-bff1-4f2e-919e-1121add42bbf", "knative.dev/key": "validation.webhook.provisioners.karpenter.sh", "duration": "189.487858ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.provisioners.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}
Example log from webhook reporting the controller was facing those certificate issues on the TLS handshake:
karpenter-58b5fbdbbd-knd4w webhook 2022/07/14 09:16:03 http: TLS handshake error from 10.69.15.149:36166: remote error: tls: bad certificate
@offzale Thanks for reporting this, the issue is unrelated to consolidation but I'm working on a set of clean upgrade instructions and will update the post here.
@offzale For now, restarting the karpenter deployment a few times should resolve the issue (kubectl rollout restart deployment -n karpenter karpenter
). There's a webhook that doesn't appear to get the correct certificate attached consistently at startup, so restarting a few times will usually resolve the issue. We're looking into a better fix.
I restarted the pod after a couple of minutes and that did the trick indeed. Thanks!
I will leave it running and keep an eye on it to gather some feedback. What I have noticed so far an increment on cpu usage by +640%, in comparison to the resources it normally takes.
Also, I believe it would be handy to have some sort of threshold configuration, e.g. I want karpenter to consider that the nodes cannot take any further load at an 85% of resources allocation, to make some room for cronjobs' runs. Otherwise, I could imagine the cluster constantly scaling up and down every time a few cronjobs get running at a time. But this could be a future improvement of course :)
Thanks, can you provide some information on the number of pods/nodes in your cluster overall?
Thanks, can you provide some information on the number of pods/nodes in your cluster overall?
I am testing it in a small 5-6 node cluster. It is great karpenter has been getting all pods running on the fewer nodes possible now, taking advantage of all available resources in the cluster. However, I faced the issue of karpenter leaving no room at all in the nodes so the cronjobs' runs make the cluster scale every other minute. The cluster has been scaling a node up/down every ~90 seconds in average for the last day. I am working this issue around by having the cluster-overprovisioner take some resources at the moment.
Other than that, it is working really well so far.
Hi,
I'm trying to test it as well, and got the following error while trying to add a Provisioner:
Internal error occurred: failed calling webhook "defaulting.webhook.karpenter.k8s.aws": Post "https://karpenter.karpenter.svc:443/?timeout=10s": x509: certificate signed by unknown authority
I restarted the pods multiple times without any success.
Unfortunately it appears to be random at the moment which mutating webhook configuration gets updated so you may need to restart the deployment several times with kubectl rollout restart deployment -n karpenter karpenter
, letting it get started running between restarts.
Could it be something else? I tried deleting the karpenter pods 20 times.
@tzneal can you check the certificate issue? I have restarted the karpenter pods more than several times, but it's not working.
@tzneal can you check the certificate issue?
I have restarted the karpenter pods more than several times, but it's not working.
I posted a new snapshot this morning that should have resolved this issue. Please let me know if you still run into problems.
@tzneal can you check the certificate issue? I have restarted the karpenter pods more than several times, but it's not working.
I posted a new snapshot this morning that should have resolved this issue. Please let me know if you still run into problems.
The new commit fixed the issue. Thanks, I'm starting to check it as well.
Enabling the consolidation doesn't work for me.
I'm using argocd, and here is the diff when I replace the ttlSecondsAfterEmpty
with consolidations
:
But I get an error:
2022-07-22T06:09:18.502Z DEBUG webhook AdmissionReview patch={ type: , body: } {"commit": "062a029", "knative.dev/kind": "karpenter.sh/v1alpha5, Kind=Provisioner", "knative.dev/namespace": "", "knative.dev/name": "mobile-devops-dev-spot", "knative.dev/operation": "UPDATE", "knative.dev/resource": "karpenter.sh/v1alpha5, Resource=provisioners", "knative.dev/subresource": "", "knative.dev/userinfo": "{system:serviceaccount:argocd:argocd-application-controller df7d59f7-82fb-4d0e-b0c2-41b6d13fe6bd [system:serviceaccounts system:serviceaccounts:argocd system:authenticated] map[]}", "admissionreview/uid": "4b08ed49-9964-4ec4-8fe8-ddad11f781b8", "admissionreview/allowed": false, "admissionreview/result": "&Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:mutation failed: cannot decode incoming new object: json: unknown field \"consolidation\",Reason:BadRequest,Details:nil,Code:400,}"}
I verified and I'm using the latest CRD that contains:
$ kubectl get crd provisioners.karpenter.sh -o json | jq '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.consolidation'
{
"description": "Consolidation are the consolidation parameters",
"properties": {
"enabled": {
"description": "Enabled enables consolidation if it has been set",
"type": "boolean"
}
},
"type": "object"
}
$ kubectl get crd provisioners.karpenter.sh -o json | jq '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.ttlSecondsAfterEmpty'
{
"description": "TTLSecondsAfterEmpty is the number of seconds the controller will wait before attempting to delete a node, measured from when the node is detected to be empty. A Node is considered to be empty when it does not have pods scheduled to it, excluding daemonsets. \n Termination due to no utilization is disabled if this field is not set.",
"format": "int64",
"type": "integer"
}
@liorfranko That sort of looks like the webhook is has not been updated. Is there a chance you're running the old webhook image?
I'm running this image of webhook: https://github.com/aws/karpenter/blob/571b507deb9e8fad8b4d7189ba8cdc1bf095d465/charts/karpenter/values.yaml#L126
That shouldn't be the one that the helm chart above installed. The two images for the snapshot should be:
public.ecr.aws/karpenter/controller:571b507deb9e8fad8b4d7189ba8cdc1bf095d465
public.ecr.aws/karpenter/webhook:571b507deb9e8fad8b4d7189ba8cdc1bf095d465
Thanks, it's working now. I'll run it next week on a nice group of nodes.
@liorfranko Great! Let us know how it works for you.
public.ecr.aws/karpenter/controller:571b507deb9e8fad8b4d7189ba8cdc1bf095d465 public.ecr.aws/karpenter/webhook:571b507deb9e8fad8b4d7189ba8cdc1bf095d465
Running the previous helm command produces a yaml file with these images that cause the webhook to fail. After swapping out the images it worked fine.
$ ag image:
karpenter.yaml
305: image: public.ecr.aws/karpenter/controller:v0.13.2@sha256:af463b2ab0a9b7b1fdf0991ee733dd8bcf5eabf80907f69ceddda28556aead31
344: image: public.ecr.aws/karpenter/webhook:v0.13.2@sha256:e10488262a58173911d2b17d6ef1385979e33334807efd8783e040aa241dd239
error
Status:Failure,Message:mutation failed: cannot decode incoming new object: json: unknown field \"consolidation\",Reason:BadRequest,Details:nil,Code:400,}"}
@dennisme I can't reproduce this:
$ export COMMIT="63e6d43d0c6b30b260c63dc01afcb58baabc8020"
$ helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version v0-${COMMIT} --namespace karpenter \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
--set clusterName=${CLUSTER_NAME} \
--set clusterEndpoint=${CLUSTER_ENDPOINT} \
--set aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
--wait
Release "karpenter" has been upgraded. Happy Helming!
NAME: karpenter
LAST DEPLOYED: Mon Jul 25 15:37:46 2022
NAMESPACE: karpenter
STATUS: deployed
REVISION: 55
TEST SUITE: None
$ k get deployment -n karpenter -o yaml | grep image
image: public.ecr.aws/karpenter/controller:63e6d43d0c6b30b260c63dc01afcb58baabc8020@sha256:b66a0943cb07f2fcbd9dc072bd90e5dc3fd83896a95aedf3b41d994172d1f96b
imagePullPolicy: IfNotPresent
image: public.ecr.aws/karpenter/webhook:63e6d43d0c6b30b260c63dc01afcb58baabc8020@sha256:208a715b5774e10d1f70f48ef931c19541ef3c2d31c52f04e648af53fd767692
imagePullPolicy: IfNotPresent
@tzneal yeppp, was a local helm version issue and the oci registry vs https://charts.karpenter.sh/
. My output is consistant to yours. Thanks for the reply.
However, I faced the issue of karpenter leaving no room at all in the nodes so the cronjobs' runs make the cluster scale every other minute. The cluster has been scaling a node up/down every ~90 seconds in average for the last day. I am working this issue around by having the cluster-overprovisioner take some resources at the moment.
@offzale: I think you want to increase your provisioner.spec.ttlSecondsAfterEmpty to longer than your cron job period. This will keep the idle nodes around from the 'last' cron job run.
Alternatively, maybe shutting them down and recreating them is actually the right thing to do? This depends on the time intervals involved, cost of idle resources, desired 'cold' responsiveness, and instance shutdown/bootup overhead. Point being that I don't think there's a general crystal-ball strategy here that we can use for everyone... Unfortunately, I think you will need to tune it based on what you know about your predictable-future-workload and your desired response delay vs cost tradeoffs.
I restarted the pod after a couple of minutes and that did the trick indeed. Thanks!
I will leave it running and keep an eye on it to gather some feedback. What I have noticed so far an increment on cpu usage by +640%, in comparison to the resources it normally takes.
Also, I believe it would be handy to have some sort of threshold configuration, e.g. I want karpenter to consider that the nodes cannot take any further load at an 85% of resources allocation, to make some room for cronjobs' runs. Otherwise, I could imagine the cluster constantly scaling up and down every time a few cronjobs get running at a time. But this could be a future improvement of course :)
I had a couple of things I needed to solve before testing Karpenter, and once I solved them, here is the first comparison:
I'm testing it by replacing an ASG with ~60 m5n.8xlarge instances. I’m running 27 deployments and 1 daemonset, a total of ~430 very diverse pods. Each deployment has anti-affinity so that each pod will not be deployed with other pods from the same deployment on the same node.
The total CPU request of all the pods is ~1800, the total Memory request is 6.8TB. On the ASG, the allocatable CPU was 1950 (150 unallocated cores), and the allocatable Memory 7.93TB (1.13TB unallocated Memory)
With Karpenter the allocatable CPU is 2400 (510 unallocated cores), and the allocatable Memory 13.8TB (6.18TB unallocated Memory)
Here is the diversity of the nodes with Karpenter:
2 r6a.8xlarge
2 r5.2xlarge
1 c4.8xlarge
2 r5ad.4xlarge
3 r5ad.8xlarge
6 r5ad.4xlarge
6 r5ad.8xlarge
1 r5ad.4xlarge
1 r5ad.8xlarge
1 r5ad.4xlarge
2 r5ad.8xlarge
1 c6i.12xlarge
2 r5ad.4xlarge
1 r5ad.8xlarge
1 r5ad.4xlarge
4 r5ad.8xlarge
1 r5ad.4xlarge
3 r5ad.8xlarge
1 c6i.12xlarge
1 r5ad.4xlarge
3 r5ad.8xlarge
1 r5ad.4xlarge
1 r5n.4xlarge
1 r5.2xlarge
1 r5dn.4xlarge
1 r5.2xlarge
2 c6i.12xlarge
1 r6a.12xlarge
2 c6i.12xlarge
2 r6a.8xlarge
1 r5.2xlarge
5 c6i.12xlarge
1 r6a.8xlarge
1 r5.2xlarge
1 r5n.4xlarge
3 c6i.12xlarge
1 r6a.8xlarge
2 c6i.12xlarge
1 r5.2xlarge
1 r5n.4xlarge
2 r5.2xlarge
1 r5n.4xlarge
1 r6a.8xlarge
1 r5n.4xlarge
1 c6i.12xlarge
2 r5.2xlarge
1 c5.2xlarge
2 c6i.12xlarge
Thanks for the info @liorfranko. What does your provisioner look like? Karpenter implements two forms of consolidation. The first is where it will delete a node if the pods on that node can run elsewhere. Due to the anti-affinity rules on your pods, it sounds like this isn't possible.
The second is where it will replace a node with a cheaper node if possible. This should be happening in your case unless the provisioner is overly constrained to larger types only. Since you've got a few 2xlarge types there, that doesn't appear to be the case either.
That looks to be 85 nodes that Karpenter has launched. Do your work loads have preferred anti-affinities or node selectors?
Here is the provisioner spec:
spec:
consolidation:
enabled: true
labels:
intent: apps
nodegroup-name: delivery-network-consumers-spot
project: mobile-delivery-network-consumers
providerRef:
name: delivery-network-consumers-spot
requirements:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1b
- us-east-1d
- us-east-1e
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values:
- nano
- micro
- small
- large
- 16xlarge
- 18xlarge
- 24xlarge
- 32xlarge
- 48xlarge
- metal
- key: karpenter.k8s.aws/instance-family
operator: NotIn
values:
- t3
- t3a
- im4gn
- is4gen
- i4i
- i3
- i3en
- d2
- d3
- d3en
- h1
- c4
- r4
- key: kubernetes.io/arch
operator: In
values:
- amd64
Here is an example of possible consolidation:
kubectl describe nodes ip-10-206-7-199.ec2.internal
Name: ip-10-206-7-199.ec2.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=r5.2xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1b
intent=apps
karpenter.k8s.aws/instance-cpu=8
karpenter.k8s.aws/instance-family=r5
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-memory=65536
karpenter.k8s.aws/instance-pods=58
karpenter.k8s.aws/instance-size=2xlarge
karpenter.sh/capacity-type=spot
karpenter.sh/initialized=true
karpenter.sh/provisioner-name=delivery-network-consumers-spot
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-206-7-199.ec2.internalec2ssa.info
kubernetes.io/os=linux
node.kubernetes.io/instance-type=r5.2xlarge
nodegroup-name=delivery-network-consumers-spot
project=mobile-delivery-network-consumers
topology.ebs.csi.aws.com/zone=us-east-1b
topology.kubernetes.io/region=us-east-1
topology.kubernetes.io/zone=us-east-1b
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-08f8f5e78bc941095"}
node.alpha.kubernetes.io/ttl: 15
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 08 Aug 2022 20:24:27 +0300
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-10-206-7-199.ec2.internal
AcquireTime: <unset>
RenewTime: Mon, 08 Aug 2022 20:41:06 +0300
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Mon, 08 Aug 2022 20:38:36 +0300 Mon, 08 Aug 2022 20:25:06 +0300 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 08 Aug 2022 20:38:36 +0300 Mon, 08 Aug 2022 20:25:06 +0300 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 08 Aug 2022 20:38:36 +0300 Mon, 08 Aug 2022 20:25:06 +0300 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 08 Aug 2022 20:38:36 +0300 Mon, 08 Aug 2022 20:25:36 +0300 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.206.7.199
ExternalIP: 54.197.31.63
Hostname: ip-10-206-7-199.ec2.internal
InternalDNS: ip-10-206-7-199.ec2.internal
InternalDNS: ip-10-206-7-199.ec2ssa.info
ExternalDNS: ec2-54-197-31-63.compute-1.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65047656Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 7910m
ephemeral-storage: 18242267924
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 64030824Ki
pods: 58
System Info:
Machine ID: ec2d577aa3820e3e2f33858018d3bd99
System UUID: ec2d577a-a382-0e3e-2f33-858018d3bd99
Boot ID: 8ce73586-0ff6-4c6b-bbef-01172b11230c
Kernel Version: 5.4.204-113.362.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.4.13
Kubelet Version: v1.20.15-eks-99076b2
Kube-Proxy Version: v1.20.15-eks-99076b2
ProviderID: aws:///us-east-1b/i-08f8f5e78bc941095
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
delivery-apps kafka-backup-mrmjz 500m (6%) 1 (12%) 600Mi (0%) 1Gi (1%) 16m
delivery-apps taskschd-consumer-78569c5c67-fcvqp 2200m (27%) 3200m (40%) 3572Mi (5%) 3572Mi (5%) 12m
istio-system istio-cni-node-kwz56 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16m
kube-system aws-node-gwp4n 10m (0%) 0 (0%) 0 (0%) 0 (0%) 16m
kube-system aws-node-termination-handler-7k5qb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16m
kube-system ebs-csi-node-fqdvr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16m
kube-system kube-proxy-nqk9w 100m (1%) 0 (0%) 0 (0%) 0 (0%) 16m
logging filebeat-8czsb 500m (6%) 2 (25%) 1Gi (1%) 1Gi (1%) 16m
monitoring kube-prometheus-stack-prometheus-node-exporter-qtlfj 50m (0%) 0 (0%) 100Mi (0%) 100Mi (0%) 16m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3360m (42%) 6200m (78%)
memory 5296Mi (8%) 5720Mi (9%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 16m kubelet Starting kubelet.
Warning InvalidDiskCapacity 16m kubelet invalid capacity 0 on image filesystem
Normal NodeAllocatableEnforced 16m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 16m (x3 over 16m) kubelet Node ip-10-206-7-199.ec2.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 16m (x3 over 16m) kubelet Node ip-10-206-7-199.ec2.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 16m (x3 over 16m) kubelet Node ip-10-206-7-199.ec2.internal status is now: NodeHasSufficientPID
Normal Starting 15m kube-proxy Starting kube-proxy.
Normal NodeReady 15m kubelet Node ip-10-206-7-199.ec2.internal status is now: NodeReady
The pod taskschd-consumer-78569c5c67-fcvqp
is the only applicative pod on that node, all the rest are deamonsets.
It can be moved to ip-10-206-30-103.ec2.internal
:
kubectl describe nodes ip-10-206-30-103.ec2.internal
Name: ip-10-206-30-103.ec2.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=c6i.12xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1b
intent=apps
karpenter.k8s.aws/instance-cpu=48
karpenter.k8s.aws/instance-family=c6i
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-memory=98304
karpenter.k8s.aws/instance-pods=234
karpenter.k8s.aws/instance-size=12xlarge
karpenter.sh/capacity-type=spot
karpenter.sh/initialized=true
karpenter.sh/provisioner-name=delivery-network-consumers-spot
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-206-30-103.ec2.internalec2ssa.info
kubernetes.io/os=linux
node.kubernetes.io/instance-type=c6i.12xlarge
nodegroup-name=delivery-network-consumers-spot
project=mobile-delivery-network-consumers
topology.ebs.csi.aws.com/zone=us-east-1b
topology.kubernetes.io/region=us-east-1
topology.kubernetes.io/zone=us-east-1b
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-04e505e8cc763973f"}
node.alpha.kubernetes.io/ttl: 15
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 08 Aug 2022 13:26:01 +0300
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-10-206-30-103.ec2.internal
AcquireTime: <unset>
RenewTime: Mon, 08 Aug 2022 20:41:58 +0300
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
Ready True Mon, 08 Aug 2022 20:38:21 +0300 Mon, 08 Aug 2022 13:27:10 +0300 KubeletReady kubelet is posting ready status
MemoryPressure False Mon, 08 Aug 2022 20:38:21 +0300 Mon, 08 Aug 2022 13:26:40 +0300 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 08 Aug 2022 20:38:21 +0300 Mon, 08 Aug 2022 13:26:40 +0300 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 08 Aug 2022 20:38:21 +0300 Mon, 08 Aug 2022 13:26:40 +0300 KubeletHasSufficientPID kubelet has sufficient PID available
Addresses:
InternalIP: 10.206.30.103
ExternalIP: 54.226.6.184
Hostname: ip-10-206-30-103.ec2.internal
InternalDNS: ip-10-206-30-103.ec2.internal
InternalDNS: ip-10-206-30-103.ec2ssa.info
ExternalDNS: ec2-54-226-6-184.compute-1.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 48
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 97323012Ki
pods: 234
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 47810m
ephemeral-storage: 18242267924
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 94323716Ki
pods: 234
System Info:
Machine ID: ec237b6d4fea6dcf056164a0fb4aad15
System UUID: ec237b6d-4fea-6dcf-0561-64a0fb4aad15
Boot ID: 7136840a-db82-4574-8313-2b39acce9907
Kernel Version: 5.4.204-113.362.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.4.13
Kubelet Version: v1.20.15-eks-99076b2
Kube-Proxy Version: v1.20.15-eks-99076b2
ProviderID: aws:///us-east-1b/i-04e505e8cc763973f
Non-terminated Pods: (11 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
delivery-apps capping-consumer-6c77c7fbd8-z8j9w 6 (12%) 10 (20%) 11Gi (12%) 11Gi (12%) 7h15m
delivery-apps device-install-consumer-7f78685654-k25f6 6 (12%) 8 (16%) 15860Mi (17%) 15860Mi (17%) 7h16m
delivery-apps kafka-backup-s6l46 500m (1%) 1 (2%) 600Mi (0%) 1Gi (1%) 7h15m
delivery-apps track-ad-consumer-688c575f45-2gvbn 7 (14%) 10 (20%) 56820Mi (61%) 56820Mi (61%) 7h15m
istio-system istio-cni-node-g9m2x 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7h15m
kube-system aws-node-8fbx2 10m (0%) 0 (0%) 0 (0%) 0 (0%) 7h15m
kube-system aws-node-termination-handler-qspnn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7h15m
kube-system ebs-csi-node-cjnrf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7h15m
kube-system kube-proxy-bmnxm 100m (0%) 0 (0%) 0 (0%) 0 (0%) 7h15m
logging filebeat-2nq4s 500m (1%) 2 (4%) 1Gi (1%) 1Gi (1%) 3h10m
monitoring kube-prometheus-stack-prometheus-node-exporter-rlx8c 50m (0%) 0 (0%) 100Mi (0%) 100Mi (0%) 7h15m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 20160m (42%) 31 (64%)
memory 85668Mi (93%) 86092Mi (93%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events: <none>
What does the spec.affinity
look like for taskschd-consumer-78569c5c67-fcvqp
?
I think it's related to several PDBs that were configured with minAvailable: 100%
Let me change it and I'll get back to you
It almost didn't effect, the total number of cores decreased by 20 cores and the memory by 200GB
I think that the problem is related to the chosen instances.
I see many c6i.12xlarge
nodes where the CPU allocation is half full, but the memory is fully utilized:
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 48
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 97323012Ki
pods: 234
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 22160m (46%) 34 (71%)
memory 87716Mi (95%) 88140Mi (95%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
On the other hand, I see many r6a.8xlarge
nodes where the CPU allocation is full, and the memory is only 35% utilized.
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 32
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 258598332Ki
pods: 234
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 29160m (91%) 40 (125%)
memory 88716Mi (35%) 89140Mi (35%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Both of the above can be replaced with an m5.8xlarge
each.
I looked at your provisioner and it looks like these are spot nodes. We currently don't replace spot nodes with smaller spot nodes. The reasoning for this is that we don't have a way of knowing if the node we would replace it with is as available or more available than the node that is being replaced. By restricting the instance size we could potentially be moving you from a node that you're likely to keep for a while to a node that will be terminated in a short period of time.
If these were on-demand nodes, then we would replace them with smaller instance types.
Thanks @tzneal
Do you know when it would be supported? Or at least let me choose to enable it.
And the reason for choosing the c6i.12xl over the m5.8xl is the probability for interruptions?
For spot, we use the capacity-optimized-prioritized strategy when we launch the node. The strategies are documented here but essentially it makes a trade-off of a slightly more expensive node in exchange for less chance of interruptions.
Node size is also not always related to the cost, I just checked the spot pricing for us-east-1 at https://aws.amazon.com/ec2/spot/pricing/ and the r6a.8xlarge was actually cheaper than m5.8xlarge.
m5.8xlarge $0.3664 per Hour
r6a.8xlarge $0.3512 per Hour
c6i.12xlarge $0.4811 per Hour
Thank @tzneal for all the information.
So up until now, everything works good. I'll monitor it for couple more days and let you know. Do you have an estimation when the current commit should be released?
Tell us about your request As a cluster admin, I want Karpenter to consolidation the application workloads by moving pods to a fewer worker nodes and scale down the cluster so that I can improve the cluster resource utilization rate.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? In an under-utilized cluster, application pods are spread across worker nodes with exceeding among of resources. This wasteful situation can be improved by carefully packing the pods to a smaller number of worker nodes with the right size. Current version of Karpenter does not support rearranging pods and continuously improve the cluster utilization. The workload consolidation feature is the missing important component to complete the cluster scaling life cycle management loop.
This workload consolidation feature is nontrivial because of the following coupling problems.
The above problems deeply couple together so that one solution affect the other. Together the problem is a variant of the bin packing problem which is NP-complete. A practical solution will implements a quick heuristic algorithm that utilizes the special structure of the problem for specific use cases and user preferences. Therefore, thorough discussions with the customer is important.
Are you currently working around this issue? Currently Karpenter will scale down empty nodes automatically. However, it does not actively move pods around to create empty nodes.
Additional context Currently the workload consolidation feature is in the design phase. We should gather inputs from the customers about their objectives and preferences.
Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)
Community Note