aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.73k stars 945 forks source link

Support m -> n consolidation actions #5944

Open michaelswierszcz opened 6 months ago

michaelswierszcz commented 6 months ago

Description

Observed Behavior: Karpenter does not seem to make very intellegent choices on how to provision node classes. It seems to just select the cheapest choice without understanding cluster wide imbalances. In my case, Karpenter selects many c7i.24xlarge, occasionaly choosing c7i.48large. It has no interest in the r7i's and m7i's because they are more expensive? As my cluster scales, the imbalance between cluster cpu utilization (70%~) and memory utilization (95%) grows larger. This means im wasting almost 30% of my cluster cpu which is actually the more expensive component of ec2's vs the memory.

Expected Behavior:

Karpenter to scale intelligently, using all available instance-family types to maintain a healthy cpu vs memory utilization ratio.

For example it may realize it should scale 3 c7i's down to 2 r7i's, saving $

Reproduction Steps (Please include YAML):

---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: dev-48s
spec:
  providerRef:
    name: eks129
  consolidation:
    enabled: true
  weight: 9
  labels:
    provisioner: m7is
  requirements:
    - key: "karpenter.k8s.aws/instance-family"
      operator: In
      values: ["c7i", "r7i", "m7i",]
    - key: "karpenter.k8s.aws/instance-size"
      operator: In
      values: ["24xlarge", "48xlarge",]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-east-1a", "us-east-1b", "us-east-1c", "us-east-1d", "us-east-1f",]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand"]
    - key: kubernetes.io/os
      operator: In
      values:
        - linux
  kubeletConfiguration:
    maxPods: 250
  limits:
    resources:
      cpu: "17280"

Versions:

tzneal commented 6 months ago

For the example given, I checked and the OD price for 3x c7i.24xlarge is cheaper than 2x r7i.24xlarge. Can you provide some more specifics with the resources requested (e.g. output of kubectl describe node the-node-name)?

michaelswierszcz commented 6 months ago

Im seeing

3 $4.284/hour (c7i.24xlarge) = $12.852 / hour -> https://instances.vantage.sh/aws/ec2/c7i.24xlarge 2 $6.350/hour (r7i.24xlarge) = $12.7 / hour -> https://instances.vantage.sh/aws/ec2/r7i.24xlarge

That wasnt the best example because the numbers end up being very close but the main idea here is karpenter seems to not have a cluster perspective of instance family imbalances.

With my provisioner yaml, i end up with 50+ c7i's, 0 r7i's, 0 m7i's. My CPU utilization is around 68% and my memory utilization is 97%. I expect karpenter to understand that there is a lack of balance and instead have cluster utilization of around 90% cpu and 90% memory.

As a work around i have to add a new provisioner with a higher weight (10), a small spec.limits.resources.cpu, and only include one instance family (r7 or m7). This work around is manual and requires constant tuning as i have to take a guess at how many m7's or r7's i need to achieve the 90% cpu and 90% memory balance.

Without the workaround, my bill is significantly higher than it needs to be :(

tzneal commented 6 months ago

I see your numbers now, not sure what I was looking at previously.

With my provisioner yaml, i end up with 50+ c7i's, 0 r7i's, 0 m7i's. My CPU utilization is around 68% and my memory utilization is 97%. I expect karpenter to understand that there is a lack of balance and instead have cluster utilization of around 90% cpu and 90% memory.

Karpenter is optimizing for cluster cost, not trying to maximize the percentage of allocated CPU & memory in the cluster. To do this, when it consolidates it will either:

a) replace 1-N nodes with a single node, if that node is cheaper and all workloads will still run on existing nodes + replacement b) delete 1-N nodes, if all workloads will still run on the existing nodes

Bin packing is an NP hard problem and there are certainly cases where Karpenter will miss ways to potentially reduce cost (e.g. replacing 3x nodes with 2x nodes).

Do you see instances in your cluster where Karpenter could reduce cost by replacing 1-N nodes with a single node, or deleting a node?

michaelswierszcz commented 6 months ago

Ah so I see.

Karpenter can only handle 1-N -> 0-1

It cannot currently handle 3-(N) -> 2-(N-1)

I think the latter would solve my cpu utilization vs memory utilization problem, which would then bring down the costs in my cluster significantly. I think this is a pretty important feature if AWS/Karpenter is serious helping customers reduce waste and keep costs down. I could probably save 30% on my EC2 bill if Karpenter could handle atleast 3 -> 2 node reduction problem.

And for your final question, no. Karpenter is doing a good job consolidating 1-N to 0-1 nodes. 👍

dlmather commented 6 months ago

Quick follow up question that's been a bit unclear to me: In the considerations for "replace 1-N nodes with a single node" does that kick in if the replacement costs exactly the same in terms of stated costs, e.g. will Karpenter try to replace two m5.4xlarge with a single m5.8xlarge if it's able otherwise to provision both? In my case, since we don't right size our daemonsets, 8xlarges are more efficient for us, in practice though we seem to see Karpenter not "combining up" to larger nodes, even though fewer is better for us.

EDIT: based on this pointer a teammate shared with me, https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/helpers.go#L154C1-L155C1 it appears it may not try to pick bigger nodes just to consolidate down the number of instances

michaelswierszcz commented 6 months ago

I haven't seen that behavior @dlmather. I agreed that it would be beneficial.