Karpenter workload consolidation/defragmentation

felix-zhe-huang commented 2 years ago

Tell us about your request As a cluster admin, I want Karpenter to consolidation the application workloads by moving pods to a fewer worker nodes and scale down the cluster so that I can improve the cluster resource utilization rate.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? In an under-utilized cluster, application pods are spread across worker nodes with exceeding among of resources. This wasteful situation can be improved by carefully packing the pods to a smaller number of worker nodes with the right size. Current version of Karpenter does not support rearranging pods and continuously improve the cluster utilization. The workload consolidation feature is the missing important component to complete the cluster scaling life cycle management loop.

This workload consolidation feature is nontrivial because of the following coupling problems.

Pod Packing: The pod packing problem determines which pods should be hosted together by the same worker node according to their taints and constraints. The goal is to produce a fewer well-balanced groups of pods that can be hosted by worker nodes with just the right size.
Instance Type Selection: According to the pod packing solution, the instance type selection problem determines which combination of instance types should be used to host the pods after the rearrangement.

The above problems deeply couple together so that one solution affect the other. Together the problem is a variant of the bin packing problem which is NP-complete. A practical solution will implements a quick heuristic algorithm that utilizes the special structure of the problem for specific use cases and user preferences. Therefore, thorough discussions with the customer is important.

Are you currently working around this issue? Currently Karpenter will scale down empty nodes automatically. However, it does not actively move pods around to create empty nodes.

Additional context Currently the workload consolidation feature is in the design phase. We should gather inputs from the customers about their objectives and preferences.

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

matti commented 2 years ago

what about https://github.com/kubernetes-sigs/descheduler which already implements some of this?

olemarkus commented 2 years ago

i've been using descheduler for this. But mind one can only use HighNodeUtilization. If you enable the other strategies, you're in for a great time of karpenter illegally binding pods to nodes with taints they don't tolerate (the unready taints for example), descheduler terminating those pods, karpenter spinning up new pods due to #1044, descheduler again evicting those pods ...

Also mind that this is really sub-optimal compared to what Cluster Autoscaler (CAS) does. With CAS I can set a threshold of 70-80% and it will really condense the cluster. CAS gets scheduling mostly correct because it simulates kube-scheduler. Descheduler, however, you need really low thresholds. Because it will just evict everything on the node once you hit that threshold, hoping it will rescheduler elsewhere. So with aggressive limits, it will just terminate pods over and over and over again.

I think the CAS approach is the correct one. I need to really condense the cluster, and the only way to do that is to simulate what will happen after evictions. At least evicting pods that will just reschedule is meaningless.

I also think changing any instance types is an overkill. If I have 1 remaining node with some CPUs to spare, that is fine. At least I would like a strategy that tries to put pods on existing nodes first, then check on the remaining node (the one with the lowest utilisation), if another instance type fits.

A more fancy thing to do, which I think comes much later, is to look at the overall mem vs cpu balance in the cluster. E.g if the cluster shifts from generally cpu bound to memory bound, it would be nice if karpenter could adjust for that. But then we hit the knapsack-like problem that can get a bit tricky to work out.

Anto450 commented 2 years ago

Currently Karpenter will scale down empty nodes automatically. However, it does not actively move pods around to create empty nodes We are more worried to use karpenter because of this missing major feature for ensuring that we run right sized instance at all time. Can someone let me know the progress here we are in need this feature badly

stevehipwell commented 2 years ago

I'd like Karpenter to terminate (or request termination) of a node when it has a low density of pods and there is another node which could take the nodes (#1491).

imagekitio commented 2 years ago

Karpenter looks exciting, but for large-scale K8 clusters deployments, this is pretty much a prerequisite.

Is there any discussion or design document about the possible approaches that can be taken for bin packing of the existing workload?

ellistarn commented 2 years ago

We're currently laying the foundation by implementing the remaining scheduling spec (affinity/antiaffinity). After that, we plan to make rapid progress on defrag design. I expect we'll start small (e.g. 1 node at a time, simple compactions) and get more sophisticated over time. This is pending design, though.

In the short term, you can combine poddisruptionbudgets and ttlsecondsuntilexpired to achieve soft defrag.

jcogilvie commented 2 years ago

In the end my requirement is the same as the others in this thread, but one of the reasons for the requirement, which I have not yet seen captured, is that minimizing (to some reasonable limit) the number of hosts can impact the cost of anything billed per-host (e.g., DataDog APM).

dragosrosculete commented 2 years ago

Hi, this is really important, needs to have the same functionality as cluster-autoscaler. This is preventing me from switching to Karpenter.

ryan4yin commented 2 years ago

We alose need this feature, the cluster's cost increase when we migrate to karpenter due to the increased nodes count.

In our scenario, we use karpenter in a EMR on EKS cluster, which creates CR(jobbatch) on EKS cluster and those CR will create pods, we can not just add a poddisruptionbudgets for those workloads simply.

mandeepgoyat commented 2 years ago

We also need this feature. In our scenario, we would like to terminate under utilized nodes by actively move pods around to create empty nodes.

Any idea about its release date ?

BrewedCoffee commented 2 years ago

In the short term, you can combine poddisruptionbudgets and ttlsecondsuntilexpired to achieve soft defrag.

Am I understanding correctly: this could in theory terminate pods that had recently been spun up on an existing node?

Are there any workarounds right now to move affected pods to another node so there is no interruption? Or would this only be achieved with the tracked feature?

ellistarn commented 2 years ago

We've laid down most of the groundwork for this feature with in flight nodes, not binding pods, topology, affinity, etc. This is one of our highest priorities right now, and we'll be sure to make early builds available to ya'll. If you're interested in discussing the design, feel free to drop by https://github.com/aws/karpenter/blob/main/WORKING_GROUP.md

sftim commented 2 years ago

As written, the feature request wouldn't consider an approach where it becomes cost effective to run smaller nodes. I'm imagining a case where the spot market does not provide large instances but there is spot capacity for smaller

Does this feature request need to consider any of:

load skewing (as defined in Energy-aware server provisioning and load dispatching for connection-intensive internet services to pack the workload into the smallest number of nodes. Load skewing is similar to the `High node utilization' strategy for the Kubernetes descheduler, but explicitly considers marking compute nodes for draining when a cluster scale-in is desired.
rate limits on evictions (to manage the impact of shifting many small existing Pods onto a new large node if that becomes the most appropriate option, or vice versa)
interaction with pre-emptible / background jobs
- let's say you have a workload that wants lots of RAM, say 3TiB of it per replica. But you also have another use for the remaining CPU / RAM in the instance. You want to manage some easily preemptible Pods to eat up the available spare resource, but you also don't want those background jobs to influence the overall cluster scaling in any way.
node fit and interaction with scheduling extension mechanisms - see https://github.com/kubernetes/kubernetes/issues/48657 for discussion of how these might look.
- I don't think there's any KEP yet ?

Some of those might be designed for but not initially implemented.

tzneal commented 2 years ago

load skewing (as defined in Energy-aware server provisioning and load dispatching for connection-intensive internet services to pack the workload into the smallest number of nodes. Load skewing is similar to the `High node utilization' strategy for the Kubernetes descheduler, but explicitly considers marking compute nodes for draining when a cluster scale-in is desired.

One method we're looking at using for consolidation is by removing nodes when the node's pods can run on other nodes in the cluster. Unless I misunderstand, this is effectively "load skewing" as we are reducing the number of nodes to concentrate the workloads on the remaining nodes.

rate limits on evictions (to manage the impact of shifting many small existing Pods onto a new large node if that becomes the most appropriate option, or vice versa)

Kubernetes has a concept of a rate limit on pod evictions and it's controlled by the Pod Disruption Budget that selects the pods. Our consolidation mechanisms will adhere to any PDBs that are defined.

interaction with pre-emptible / background jobs

let's say you have a workload that wants lots of RAM, say 3TiB of it per replica. But you also have another use for the remaining CPU / RAM in the instance. You want to manage some easily preemptible Pods to eat up the available spare resource, but you also don't want those background jobs to influence the overall cluster scaling in any way.

We don't have support for this currently. I believe that CAS has a similar feature through its expendable-pods-priority-cutoff , can you write a separate feature request issue for this?

node fit and interaction with scheduling extension mechanisms - - see https://github.com/kubernetes/kubernetes/issues/48657 for discussion of how these might look.

I believe that this will be a problem for any autoscaler. Using different schedulers which follow different rules or extensions mechanisms that change the scheduling rules inevitably will lead to issues.

sftim commented 2 years ago

For load skewing, before you remove a node from the cluster, you gracefully drain the workloads that are running on it. For example, a scheduler / descheduler could use the eviction API to trigger the removal of a Pod, wait for that to complete, and then avoid placing a new Pod onto the node that is earmarked for draining.

sftim commented 2 years ago

Our consolidation mechanisms will adhere to any PDBs that are defined.

PDB is a useful mechanism and definitely worth accounting for. For nodes that run very large Pods, maybe also consider the impact on API Priority and Fairness. Imagine if draining a small number of large nodes that do work for one namespace led to higher Pod creation latency for all namespaces.

tzneal commented 2 years ago

For load skewing, before you remove a node from the cluster, you gracefully drain the workloads that are running on it. For example, a scheduler / descheduler could use the eviction API to trigger the removal of a Pod, wait for that to complete, and then avoid placing a new Pod onto the node that is earmarked for draining.

Consolidation will work that way. We'll use the Evict API the same way we currently use it for node expiration.

tzneal commented 2 years ago

Things are subject to change regarding the implementation and this is not intended for any production use, but if anyone is interested in trying out consolidation you should be able to edit and use the following:

export CLUSTER_NAME="<INSERT_CLUSTER_NAME>"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export KARPENTER_IAM_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/${CLUSTER_NAME}-karpenter"
export CLUSTER_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.endpoint" --output text)"

export COMMIT="63e6d43d0c6b30b260c63dc01afcb58baabc8020"

# install the snapshot

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version v0-${COMMIT} --namespace karpenter \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
  --set clusterName=${CLUSTER_NAME} \
  --set clusterEndpoint=${CLUSTER_ENDPOINT} \
  --set aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
  --wait

# update the provisioner CRD to have the consolidation parameter

kubectl replace -f https://raw.githubusercontent.com/aws/karpenter/${COMMIT}/charts/karpenter/crds/karpenter.sh_provisioners.yaml

To enable consolidation for a provisioner, replace the ttlSecondsAfterEmpty parameter with a consolidation section:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  consolidation:
    enabled: true

If you test this and want to roll back to a previous version, ensure that you remove the consolidation parameter from your provisioner before uninstalling.

Updates by Commit Hash

7/13 - Initial - 47afd627371c3a16ed677f7091b0579c8111cd7b
7/14 - Reduced CPU Usage when cluster is fully consolidated - 010e04fb7826eae80b6c62984083e99cfe52ce30
7/14 - Reverted back to policy/v1beta1 for now - e55c61c8117ceba052a0942cf79ab5a37c63ff40
7/20 - Certificate issue resolved - 571b507deb9e8fad8b4d7189ba8cdc1bf095d465
7/25 - Rebased on top of main - 63e6d43d0c6b30b260c63dc01afcb58baabc8020

offzale commented 2 years ago

@tzneal Many thanks for the work on this! I wanted to give it a try in a development environment but I have had issues with the karpenter certificate. The controller reports it does not trust the issuer of the cert when accessing the webhook. I never had this issue with the official chart. Did you face this same issue in any of your tests?

Here the error logs from the webhook container:

karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.021Z     ERROR   webhook.DefaultingWebhook       Reconcile error {"commit": "47afd62", "knative.dev/traceid": "98e9ea47-e4e6-49ac-8580-9ca9dfa0fbb9", "knative.dev/key": "defaulting.webhook.karpenter.k8s.aws", "duration": "52.788µs", "error": "secret \"karpenter-cert\" is missing \"ca-cert.pem\" key"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.021Z     ERROR   webhook.ValidationWebhook       Reconcile error {"commit": "47afd62", "knative.dev/traceid": "4e4bc2bd-0bf2-4b73-801a-e8f558c47996", "knative.dev/key": "validation.webhook.provisioners.karpenter.sh", "duration": "50.594µs", "error": "secret \"karpenter-cert\" is missing \"ca-cert.pem\" key"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.097Z     ERROR   webhook.ValidationWebhook       Reconcile error {"commit": "47afd62", "knative.dev/traceid": "c43895bd-7e88-45d0-80ba-dd8596ed9e1f", "knative.dev/key": "validation.webhook.provisioners.karpenter.sh", "duration": "39.71µs", "error": "secret \"karpenter-cert\" is missing \"ca-cert.pem\" key"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.098Z     ERROR   webhook.DefaultingWebhook       Reconcile error {"commit": "47afd62", "knative.dev/traceid": "023d4057-bfeb-4b50-8458-b675b7fd369e", "knative.dev/key": "defaulting.webhook.karpenter.k8s.aws", "duration": "27.876µs", "error": "secret \"karpenter-cert\" is missing \"ca-cert.pem\" key"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.297Z     ERROR   webhook.DefaultingWebhook       Reconcile error {"commit": "47afd62", "knative.dev/traceid": "dbecc636-6c9c-435d-9060-9fac1791d575", "knative.dev/key": "defaulting.webhook.karpenter.k8s.aws", "duration": "188.82357ms", "error": "failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"defaulting.webhook.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"}
karpenter-58b5fbdbbd-knd4w webhook 2022-07-14T09:09:24.297Z     ERROR   webhook.ValidationWebhook       Reconcile error {"commit": "47afd62", "knative.dev/traceid": "0c0d2c22-bff1-4f2e-919e-1121add42bbf", "knative.dev/key": "validation.webhook.provisioners.karpenter.sh", "duration": "189.487858ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.provisioners.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}

Example log from webhook reporting the controller was facing those certificate issues on the TLS handshake:

karpenter-58b5fbdbbd-knd4w webhook 2022/07/14 09:16:03 http: TLS handshake error from 10.69.15.149:36166: remote error: tls: bad certificate

tzneal commented 2 years ago

@offzale Thanks for reporting this, the issue is unrelated to consolidation but I'm working on a set of clean upgrade instructions and will update the post here.

tzneal commented 2 years ago

@offzale For now, restarting the karpenter deployment a few times should resolve the issue (kubectl rollout restart deployment -n karpenter karpenter). There's a webhook that doesn't appear to get the correct certificate attached consistently at startup, so restarting a few times will usually resolve the issue. We're looking into a better fix.

offzale commented 2 years ago

I restarted the pod after a couple of minutes and that did the trick indeed. Thanks!

I will leave it running and keep an eye on it to gather some feedback. What I have noticed so far an increment on cpu usage by +640%, in comparison to the resources it normally takes.

Also, I believe it would be handy to have some sort of threshold configuration, e.g. I want karpenter to consider that the nodes cannot take any further load at an 85% of resources allocation, to make some room for cronjobs' runs. Otherwise, I could imagine the cluster constantly scaling up and down every time a few cronjobs get running at a time. But this could be a future improvement of course :)

tzneal commented 2 years ago

Thanks, can you provide some information on the number of pods/nodes in your cluster overall?

offzale commented 2 years ago

Thanks, can you provide some information on the number of pods/nodes in your cluster overall?

I am testing it in a small 5-6 node cluster. It is great karpenter has been getting all pods running on the fewer nodes possible now, taking advantage of all available resources in the cluster. However, I faced the issue of karpenter leaving no room at all in the nodes so the cronjobs' runs make the cluster scale every other minute. The cluster has been scaling a node up/down every ~90 seconds in average for the last day. I am working this issue around by having the cluster-overprovisioner take some resources at the moment.

Other than that, it is working really well so far.

liorfranko commented 2 years ago

Hi,

I'm trying to test it as well, and got the following error while trying to add a Provisioner:

Internal error occurred: failed calling webhook "defaulting.webhook.karpenter.k8s.aws": Post "https://karpenter.karpenter.svc:443/?timeout=10s": x509: certificate signed by unknown authority

I restarted the pods multiple times without any success.

tzneal commented 2 years ago

Unfortunately it appears to be random at the moment which mutating webhook configuration gets updated so you may need to restart the deployment several times with kubectl rollout restart deployment -n karpenter karpenter, letting it get started running between restarts.

liorfranko commented 2 years ago

Could it be something else? I tried deleting the karpenter pods 20 times.

liorfranko commented 2 years ago

@tzneal can you check the certificate issue? I have restarted the karpenter pods more than several times, but it's not working.

tzneal commented 2 years ago

@tzneal can you check the certificate issue?

I have restarted the karpenter pods more than several times, but it's not working.

I posted a new snapshot this morning that should have resolved this issue. Please let me know if you still run into problems.

liorfranko commented 2 years ago

@tzneal can you check the certificate issue? I have restarted the karpenter pods more than several times, but it's not working.

I posted a new snapshot this morning that should have resolved this issue. Please let me know if you still run into problems.

The new commit fixed the issue. Thanks, I'm starting to check it as well.

liorfranko commented 2 years ago

Enabling the consolidation doesn't work for me. I'm using argocd, and here is the diff when I replace the ttlSecondsAfterEmpty with consolidations: Screen Shot 2022-07-22 at 9 08 30 But I get an error:

2022-07-22T06:09:18.502Z    DEBUG   webhook AdmissionReview patch={ type: , body:  }    {"commit": "062a029", "knative.dev/kind": "karpenter.sh/v1alpha5, Kind=Provisioner", "knative.dev/namespace": "", "knative.dev/name": "mobile-devops-dev-spot", "knative.dev/operation": "UPDATE", "knative.dev/resource": "karpenter.sh/v1alpha5, Resource=provisioners", "knative.dev/subresource": "", "knative.dev/userinfo": "{system:serviceaccount:argocd:argocd-application-controller df7d59f7-82fb-4d0e-b0c2-41b6d13fe6bd [system:serviceaccounts system:serviceaccounts:argocd system:authenticated] map[]}", "admissionreview/uid": "4b08ed49-9964-4ec4-8fe8-ddad11f781b8", "admissionreview/allowed": false, "admissionreview/result": "&Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:mutation failed: cannot decode incoming new object: json: unknown field \"consolidation\",Reason:BadRequest,Details:nil,Code:400,}"}

I verified and I'm using the latest CRD that contains:

$ kubectl get crd provisioners.karpenter.sh -o json | jq '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.consolidation'
{
  "description": "Consolidation are the consolidation parameters",
  "properties": {
    "enabled": {
      "description": "Enabled enables consolidation if it has been set",
      "type": "boolean"
    }
  },
  "type": "object"
}
 $ kubectl get crd provisioners.karpenter.sh -o json | jq '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.ttlSecondsAfterEmpty'
{
  "description": "TTLSecondsAfterEmpty is the number of seconds the controller will wait before attempting to delete a node, measured from when the node is detected to be empty. A Node is considered to be empty when it does not have pods scheduled to it, excluding daemonsets. \n Termination due to no utilization is disabled if this field is not set.",
  "format": "int64",
  "type": "integer"
}

tzneal commented 2 years ago

@liorfranko That sort of looks like the webhook is has not been updated. Is there a chance you're running the old webhook image?

liorfranko commented 2 years ago

I'm running this image of webhook: https://github.com/aws/karpenter/blob/571b507deb9e8fad8b4d7189ba8cdc1bf095d465/charts/karpenter/values.yaml#L126

tzneal commented 2 years ago

That shouldn't be the one that the helm chart above installed. The two images for the snapshot should be:

public.ecr.aws/karpenter/controller:571b507deb9e8fad8b4d7189ba8cdc1bf095d465
public.ecr.aws/karpenter/webhook:571b507deb9e8fad8b4d7189ba8cdc1bf095d465

liorfranko commented 2 years ago

Thanks, it's working now. I'll run it next week on a nice group of nodes.

tzneal commented 2 years ago

@liorfranko Great! Let us know how it works for you.

dennisme commented 2 years ago

public.ecr.aws/karpenter/controller:571b507deb9e8fad8b4d7189ba8cdc1bf095d465 public.ecr.aws/karpenter/webhook:571b507deb9e8fad8b4d7189ba8cdc1bf095d465

Running the previous helm command produces a yaml file with these images that cause the webhook to fail. After swapping out the images it worked fine.

$ ag image:
karpenter.yaml
305:          image: public.ecr.aws/karpenter/controller:v0.13.2@sha256:af463b2ab0a9b7b1fdf0991ee733dd8bcf5eabf80907f69ceddda28556aead31
344:          image: public.ecr.aws/karpenter/webhook:v0.13.2@sha256:e10488262a58173911d2b17d6ef1385979e33334807efd8783e040aa241dd239

error

Status:Failure,Message:mutation failed: cannot decode incoming new object: json: unknown field \"consolidation\",Reason:BadRequest,Details:nil,Code:400,}"}

tzneal commented 2 years ago

@dennisme I can't reproduce this:


$ export COMMIT="63e6d43d0c6b30b260c63dc01afcb58baabc8020"
$ helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version v0-${COMMIT} --namespace karpenter \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
  --set clusterName=${CLUSTER_NAME} \
  --set clusterEndpoint=${CLUSTER_ENDPOINT} \
  --set aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
  --wait
Release "karpenter" has been upgraded. Happy Helming!
NAME: karpenter
LAST DEPLOYED: Mon Jul 25 15:37:46 2022
NAMESPACE: karpenter
STATUS: deployed
REVISION: 55
TEST SUITE: None

$ k get deployment -n karpenter -o yaml | grep image
          image: public.ecr.aws/karpenter/controller:63e6d43d0c6b30b260c63dc01afcb58baabc8020@sha256:b66a0943cb07f2fcbd9dc072bd90e5dc3fd83896a95aedf3b41d994172d1f96b
          imagePullPolicy: IfNotPresent
          image: public.ecr.aws/karpenter/webhook:63e6d43d0c6b30b260c63dc01afcb58baabc8020@sha256:208a715b5774e10d1f70f48ef931c19541ef3c2d31c52f04e648af53fd767692
          imagePullPolicy: IfNotPresent

dennisme commented 2 years ago

@tzneal yeppp, was a local helm version issue and the oci registry vs https://charts.karpenter.sh/. My output is consistant to yours. Thanks for the reply.

anguslees commented 2 years ago

However, I faced the issue of karpenter leaving no room at all in the nodes so the cronjobs' runs make the cluster scale every other minute. The cluster has been scaling a node up/down every ~90 seconds in average for the last day. I am working this issue around by having the cluster-overprovisioner take some resources at the moment.

@offzale: I think you want to increase your provisioner.spec.ttlSecondsAfterEmpty to longer than your cron job period. This will keep the idle nodes around from the 'last' cron job run.

Alternatively, maybe shutting them down and recreating them is actually the right thing to do? This depends on the time intervals involved, cost of idle resources, desired 'cold' responsiveness, and instance shutdown/bootup overhead. Point being that I don't think there's a general crystal-ball strategy here that we can use for everyone... Unfortunately, I think you will need to tune it based on what you know about your predictable-future-workload and your desired response delay vs cost tradeoffs.

liorfranko commented 2 years ago

I restarted the pod after a couple of minutes and that did the trick indeed. Thanks!

I will leave it running and keep an eye on it to gather some feedback. What I have noticed so far an increment on cpu usage by +640%, in comparison to the resources it normally takes.

Also, I believe it would be handy to have some sort of threshold configuration, e.g. I want karpenter to consider that the nodes cannot take any further load at an 85% of resources allocation, to make some room for cronjobs' runs. Otherwise, I could imagine the cluster constantly scaling up and down every time a few cronjobs get running at a time. But this could be a future improvement of course :)

I had a couple of things I needed to solve before testing Karpenter, and once I solved them, here is the first comparison:

I'm testing it by replacing an ASG with ~60 m5n.8xlarge instances. I’m running 27 deployments and 1 daemonset, a total of ~430 very diverse pods. Each deployment has anti-affinity so that each pod will not be deployed with other pods from the same deployment on the same node.

The total CPU request of all the pods is ~1800, the total Memory request is 6.8TB. On the ASG, the allocatable CPU was 1950 (150 unallocated cores), and the allocatable Memory 7.93TB (1.13TB unallocated Memory)

With Karpenter the allocatable CPU is 2400 (510 unallocated cores), and the allocatable Memory 13.8TB (6.18TB unallocated Memory)

Here is the diversity of the nodes with Karpenter:

   2 r6a.8xlarge
   2 r5.2xlarge
   1 c4.8xlarge
   2 r5ad.4xlarge
   3 r5ad.8xlarge
   6 r5ad.4xlarge
   6 r5ad.8xlarge
   1 r5ad.4xlarge
   1 r5ad.8xlarge
   1 r5ad.4xlarge
   2 r5ad.8xlarge
   1 c6i.12xlarge
   2 r5ad.4xlarge
   1 r5ad.8xlarge
   1 r5ad.4xlarge
   4 r5ad.8xlarge
   1 r5ad.4xlarge
   3 r5ad.8xlarge
   1 c6i.12xlarge
   1 r5ad.4xlarge
   3 r5ad.8xlarge
   1 r5ad.4xlarge
   1 r5n.4xlarge
   1 r5.2xlarge
   1 r5dn.4xlarge
   1 r5.2xlarge
   2 c6i.12xlarge
   1 r6a.12xlarge
   2 c6i.12xlarge
   2 r6a.8xlarge
   1 r5.2xlarge
   5 c6i.12xlarge
   1 r6a.8xlarge
   1 r5.2xlarge
   1 r5n.4xlarge
   3 c6i.12xlarge
   1 r6a.8xlarge
   2 c6i.12xlarge
   1 r5.2xlarge
   1 r5n.4xlarge
   2 r5.2xlarge
   1 r5n.4xlarge
   1 r6a.8xlarge
   1 r5n.4xlarge
   1 c6i.12xlarge
   2 r5.2xlarge
   1 c5.2xlarge
   2 c6i.12xlarge

tzneal commented 2 years ago

Thanks for the info @liorfranko. What does your provisioner look like? Karpenter implements two forms of consolidation. The first is where it will delete a node if the pods on that node can run elsewhere. Due to the anti-affinity rules on your pods, it sounds like this isn't possible.

The second is where it will replace a node with a cheaper node if possible. This should be happening in your case unless the provisioner is overly constrained to larger types only. Since you've got a few 2xlarge types there, that doesn't appear to be the case either.

That looks to be 85 nodes that Karpenter has launched. Do your work loads have preferred anti-affinities or node selectors?

liorfranko commented 2 years ago

Here is the provisioner spec:

spec:
  consolidation:
    enabled: true
  labels:
    intent: apps
    nodegroup-name: delivery-network-consumers-spot
    project: mobile-delivery-network-consumers
  providerRef:
    name: delivery-network-consumers-spot
  requirements:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
        - us-east-1b
        - us-east-1d
        - us-east-1e
    - key: karpenter.sh/capacity-type
      operator: In
      values:
        - spot
    - key: karpenter.k8s.aws/instance-size
      operator: NotIn
      values:
        - nano
        - micro
        - small
        - large
        - 16xlarge
        - 18xlarge
        - 24xlarge
        - 32xlarge
        - 48xlarge
        - metal
    - key: karpenter.k8s.aws/instance-family
      operator: NotIn
      values:
        - t3
        - t3a
        - im4gn
        - is4gen
        - i4i
        - i3
        - i3en
        - d2
        - d3
        - d3en
        - h1
        - c4
        - r4
    - key: kubernetes.io/arch
      operator: In
      values:
        - amd64

Here is an example of possible consolidation:

kubectl describe nodes ip-10-206-7-199.ec2.internal

Name:               ip-10-206-7-199.ec2.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=r5.2xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1b
                    intent=apps
                    karpenter.k8s.aws/instance-cpu=8
                    karpenter.k8s.aws/instance-family=r5
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-memory=65536
                    karpenter.k8s.aws/instance-pods=58
                    karpenter.k8s.aws/instance-size=2xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/provisioner-name=delivery-network-consumers-spot
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-206-7-199.ec2.internalec2ssa.info
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=r5.2xlarge
                    nodegroup-name=delivery-network-consumers-spot
                    project=mobile-delivery-network-consumers
                    topology.ebs.csi.aws.com/zone=us-east-1b
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1b
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-08f8f5e78bc941095"}
                    node.alpha.kubernetes.io/ttl: 15
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 08 Aug 2022 20:24:27 +0300
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-206-7-199.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 08 Aug 2022 20:41:06 +0300
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 08 Aug 2022 20:38:36 +0300   Mon, 08 Aug 2022 20:25:06 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 08 Aug 2022 20:38:36 +0300   Mon, 08 Aug 2022 20:25:06 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 08 Aug 2022 20:38:36 +0300   Mon, 08 Aug 2022 20:25:06 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 08 Aug 2022 20:38:36 +0300   Mon, 08 Aug 2022 20:25:36 +0300   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.206.7.199
  ExternalIP:   54.197.31.63
  Hostname:     ip-10-206-7-199.ec2.internal
  InternalDNS:  ip-10-206-7-199.ec2.internal
  InternalDNS:  ip-10-206-7-199.ec2ssa.info
  ExternalDNS:  ec2-54-197-31-63.compute-1.amazonaws.com
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         8
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      65047656Ki
  pods:                        58
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         7910m
  ephemeral-storage:           18242267924
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      64030824Ki
  pods:                        58
System Info:
  Machine ID:                 ec2d577aa3820e3e2f33858018d3bd99
  System UUID:                ec2d577a-a382-0e3e-2f33-858018d3bd99
  Boot ID:                    8ce73586-0ff6-4c6b-bbef-01172b11230c
  Kernel Version:             5.4.204-113.362.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.13
  Kubelet Version:            v1.20.15-eks-99076b2
  Kube-Proxy Version:         v1.20.15-eks-99076b2
ProviderID:                   aws:///us-east-1b/i-08f8f5e78bc941095
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------                   ----                                                    ------------  ----------   ---------------  -------------  ---
  delivery-apps               kafka-backup-mrmjz                                      500m (6%)     1 (12%)      600Mi (0%)       1Gi (1%)       16m
  delivery-apps               taskschd-consumer-78569c5c67-fcvqp                      2200m (27%)   3200m (40%)  3572Mi (5%)      3572Mi (5%)    12m
  istio-system                istio-cni-node-kwz56                                    0 (0%)        0 (0%)       0 (0%)           0 (0%)         16m
  kube-system                 aws-node-gwp4n                                          10m (0%)      0 (0%)       0 (0%)           0 (0%)         16m
  kube-system                 aws-node-termination-handler-7k5qb                      0 (0%)        0 (0%)       0 (0%)           0 (0%)         16m
  kube-system                 ebs-csi-node-fqdvr                                      0 (0%)        0 (0%)       0 (0%)           0 (0%)         16m
  kube-system                 kube-proxy-nqk9w                                        100m (1%)     0 (0%)       0 (0%)           0 (0%)         16m
  logging                     filebeat-8czsb                                          500m (6%)     2 (25%)      1Gi (1%)         1Gi (1%)       16m
  monitoring                  kube-prometheus-stack-prometheus-node-exporter-qtlfj    50m (0%)      0 (0%)       100Mi (0%)       100Mi (0%)     16m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         3360m (42%)  6200m (78%)
  memory                      5296Mi (8%)  5720Mi (9%)
  ephemeral-storage           0 (0%)       0 (0%)
  hugepages-1Gi               0 (0%)       0 (0%)
  hugepages-2Mi               0 (0%)       0 (0%)
  attachable-volumes-aws-ebs  0            0
Events:
  Type     Reason                   Age                From        Message
  ----     ------                   ----               ----        -------
  Normal   Starting                 16m                kubelet     Starting kubelet.
  Warning  InvalidDiskCapacity      16m                kubelet     invalid capacity 0 on image filesystem
  Normal   NodeAllocatableEnforced  16m                kubelet     Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  16m (x3 over 16m)  kubelet     Node ip-10-206-7-199.ec2.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    16m (x3 over 16m)  kubelet     Node ip-10-206-7-199.ec2.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     16m (x3 over 16m)  kubelet     Node ip-10-206-7-199.ec2.internal status is now: NodeHasSufficientPID
  Normal   Starting                 15m                kube-proxy  Starting kube-proxy.
  Normal   NodeReady                15m                kubelet     Node ip-10-206-7-199.ec2.internal status is now: NodeReady

The pod taskschd-consumer-78569c5c67-fcvqp is the only applicative pod on that node, all the rest are deamonsets. It can be moved to ip-10-206-30-103.ec2.internal:

kubectl describe nodes ip-10-206-30-103.ec2.internal
Name:               ip-10-206-30-103.ec2.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=c6i.12xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1b
                    intent=apps
                    karpenter.k8s.aws/instance-cpu=48
                    karpenter.k8s.aws/instance-family=c6i
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-memory=98304
                    karpenter.k8s.aws/instance-pods=234
                    karpenter.k8s.aws/instance-size=12xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/provisioner-name=delivery-network-consumers-spot
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-206-30-103.ec2.internalec2ssa.info
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=c6i.12xlarge
                    nodegroup-name=delivery-network-consumers-spot
                    project=mobile-delivery-network-consumers
                    topology.ebs.csi.aws.com/zone=us-east-1b
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1b
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-04e505e8cc763973f"}
                    node.alpha.kubernetes.io/ttl: 15
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 08 Aug 2022 13:26:01 +0300
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-206-30-103.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 08 Aug 2022 20:41:58 +0300
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  Ready            True    Mon, 08 Aug 2022 20:38:21 +0300   Mon, 08 Aug 2022 13:27:10 +0300   KubeletReady                 kubelet is posting ready status
  MemoryPressure   False   Mon, 08 Aug 2022 20:38:21 +0300   Mon, 08 Aug 2022 13:26:40 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 08 Aug 2022 20:38:21 +0300   Mon, 08 Aug 2022 13:26:40 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 08 Aug 2022 20:38:21 +0300   Mon, 08 Aug 2022 13:26:40 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
Addresses:
  InternalIP:   10.206.30.103
  ExternalIP:   54.226.6.184
  Hostname:     ip-10-206-30-103.ec2.internal
  InternalDNS:  ip-10-206-30-103.ec2.internal
  InternalDNS:  ip-10-206-30-103.ec2ssa.info
  ExternalDNS:  ec2-54-226-6-184.compute-1.amazonaws.com
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         48
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      97323012Ki
  pods:                        234
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         47810m
  ephemeral-storage:           18242267924
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      94323716Ki
  pods:                        234
System Info:
  Machine ID:                 ec237b6d4fea6dcf056164a0fb4aad15
  System UUID:                ec237b6d-4fea-6dcf-0561-64a0fb4aad15
  Boot ID:                    7136840a-db82-4574-8313-2b39acce9907
  Kernel Version:             5.4.204-113.362.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.13
  Kubelet Version:            v1.20.15-eks-99076b2
  Kube-Proxy Version:         v1.20.15-eks-99076b2
ProviderID:                   aws:///us-east-1b/i-04e505e8cc763973f
Non-terminated Pods:          (11 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                    ------------  ----------  ---------------  -------------  ---
  delivery-apps               capping-consumer-6c77c7fbd8-z8j9w                       6 (12%)       10 (20%)    11Gi (12%)       11Gi (12%)     7h15m
  delivery-apps               device-install-consumer-7f78685654-k25f6                6 (12%)       8 (16%)     15860Mi (17%)    15860Mi (17%)  7h16m
  delivery-apps               kafka-backup-s6l46                                      500m (1%)     1 (2%)      600Mi (0%)       1Gi (1%)       7h15m
  delivery-apps               track-ad-consumer-688c575f45-2gvbn                      7 (14%)       10 (20%)    56820Mi (61%)    56820Mi (61%)  7h15m
  istio-system                istio-cni-node-g9m2x                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7h15m
  kube-system                 aws-node-8fbx2                                          10m (0%)      0 (0%)      0 (0%)           0 (0%)         7h15m
  kube-system                 aws-node-termination-handler-qspnn                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         7h15m
  kube-system                 ebs-csi-node-cjnrf                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         7h15m
  kube-system                 kube-proxy-bmnxm                                        100m (0%)     0 (0%)      0 (0%)           0 (0%)         7h15m
  logging                     filebeat-2nq4s                                          500m (1%)     2 (4%)      1Gi (1%)         1Gi (1%)       3h10m
  monitoring                  kube-prometheus-stack-prometheus-node-exporter-rlx8c    50m (0%)      0 (0%)      100Mi (0%)       100Mi (0%)     7h15m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         20160m (42%)   31 (64%)
  memory                      85668Mi (93%)  86092Mi (93%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0
Events:                       <none>

tzneal commented 2 years ago

What does the spec.affinity look like for taskschd-consumer-78569c5c67-fcvqp?

liorfranko commented 2 years ago

I think it's related to several PDBs that were configured with minAvailable: 100% Let me change it and I'll get back to you

liorfranko commented 2 years ago

It almost didn't effect, the total number of cores decreased by 20 cores and the memory by 200GB

I think that the problem is related to the chosen instances. I see many c6i.12xlarge nodes where the CPU allocation is half full, but the memory is fully utilized:

Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         48
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      97323012Ki
  pods:                        234
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         22160m (46%)   34 (71%)
  memory                      87716Mi (95%)  88140Mi (95%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0

On the other hand, I see many r6a.8xlarge nodes where the CPU allocation is full, and the memory is only 35% utilized.

Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         32
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      258598332Ki
  pods:                        234
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         29160m (91%)   40 (125%)
  memory                      88716Mi (35%)  89140Mi (35%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0

Both of the above can be replaced with an m5.8xlarge each.

tzneal commented 2 years ago

I looked at your provisioner and it looks like these are spot nodes. We currently don't replace spot nodes with smaller spot nodes. The reasoning for this is that we don't have a way of knowing if the node we would replace it with is as available or more available than the node that is being replaced. By restricting the instance size we could potentially be moving you from a node that you're likely to keep for a while to a node that will be terminated in a short period of time.

If these were on-demand nodes, then we would replace them with smaller instance types.

liorfranko commented 2 years ago

Thanks @tzneal

Do you know when it would be supported? Or at least let me choose to enable it.

And the reason for choosing the c6i.12xl over the m5.8xl is the probability for interruptions?

tzneal commented 2 years ago

For spot, we use the capacity-optimized-prioritized strategy when we launch the node. The strategies are documented here but essentially it makes a trade-off of a slightly more expensive node in exchange for less chance of interruptions.

Node size is also not always related to the cost, I just checked the spot pricing for us-east-1 at https://aws.amazon.com/ec2/spot/pricing/ and the r6a.8xlarge was actually cheaper than m5.8xlarge.

m5.8xlarge      $0.3664 per Hour
r6a.8xlarge     $0.3512 per Hour
c6i.12xlarge    $0.4811 per Hour

liorfranko commented 2 years ago

Thank @tzneal for all the information.

So up until now, everything works good. I'll monitor it for couple more days and let you know. Do you have an estimation when the current commit should be released?

aws / karpenter-provider-aws

Karpenter workload consolidation/defragmentation #1091

Community Note

Updates by Commit Hash