aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.77k stars 953 forks source link

Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #5161

Closed jonathan-innis closed 1 week ago

jonathan-innis commented 11 months ago

Description

Tell us about your request

We could consider a few options to discover the expected capacity overhead for a given instance type:

  1. We could store the instance type capacity in memory once a version of that type has been launched and use that as the capacity value after the initial launch rather than basing our calculations off of some heuristic
  2. We could launch instance types, check their capacity and diff the reported capacity from the actual capacity in a generated file that can be shipped with Karpenter on release so we always have accurate measurements on the instance type overhead.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Calculating the difference between the EC2-reported memory capacity and the actual capacity of the instance as reported by kubelet.

Are you currently working around this issue?

Using a heuristic vmMemoryOverheadPercent value right now that is tunable by users and passed through karpenter-global-settings

Community Note

jonathan-innis commented 11 months ago

This is a "transferred" version of the issue in kubernetes-sigs/karpenter: https://github.com/kubernetes-sigs/karpenter/issues/716

sosheskaz commented 6 months ago

Could a simple mitigation be to add a static setting? e.g. vmMemoryOverheadOffset, which would take a static amount of memory?

When running compute which is highly heterogeneous, but relatively large, this would simplify the practice of managing this setting, because a sufficiently large static overhead would be able to outweigh the percentage, and allow for less waste.

BEvgeniyS commented 4 months ago

The vmMemoryOverheadPercent value is clearly not one-size-fits-all, so making it per-nodepool would be a great start.

I prefer approach #1 because it doesn't tie Karpenter to cloud provider updates and can work even when new instance types appear before an update.

Here's how I imagine this could be implemented: a) Convert it to something like "initvmMemoryOverheadPercent" and make it a per-nodepool setting. b) As soon as we discover or launch a single node of any given nodepool/instanceType pair, the node allocatable should be dynamically discovered and stored in memory or optionally in some form of permanent storage. If Karpenter restarts, it can rediscover this information. All systemReserved/kubeReserved and Bottlerocket's control/admin containers subtract from the allocatable, and should already be considered in the proper allocatable exposed by kubelet. Karpenter already is able to calculate daemonset overhead, so this can be factored in

The remaining issue is that if the value is too high, Karpenter won't be able to launch the node even though the capacity is there. #2 can then be used as an initial hint for known types.

ADD: Just dug into cluster-autoscaler, questioning, how come there isn't such an issue? I've done some testing: tried to schedule a pod with memory request that is total instance memory - DS overhead, and scale from 0

Turns out, the issue is there: CA tries to launch an instance for each of the matching ASG and pod is still pending. CA then kills the instance

So the issue was just hidden and much less likely to be encountered, because: a) cluster-autoscaler is pretty smart in that it would cache "true allocatable" as soon as it sees first node from ASG for the nodegroup, and doesn't retry it if it doesn't fit b) CA relies on same resource on every node in a nodegroup, so if it sees 1 node, it knows the config for whole nodegroup. (not something karpenter can rely on hence the need to discover and cache "nodepool/instanceType" pairs.

balraj-aqfer commented 2 months ago

@jonathan-innis Any latest updates on this issue.

youwalther65 commented 2 months ago

Agree in that CA has to do an easier job, but even CA does include some delta to accommodate for slight changes between instance types in same or similar NodePool/ASG etc (something just visible in main.go and not yet described in CA FAQ) Having a combination of NodePool and instance type might be somehow safe, because a NodePool is tight to an EC2NodeClass. But underlying AMI version of NodePool might change over time and AMI and/or custom user data can tweak kubeReserved and systemReserved in different ways leading to new values of allocatable. So we should think of re-evaluate the cached value. Another point just to mention is how DaemonSet overhead is factored in: Some DS can be applied to only a subset of nodes via taints/tolerations, so

BEvgeniyS commented 2 months ago

So we should think of re-evaluate the cached value.

cached value shouldn't be static, it could get updated during node registration. Cache can be flushed if nodepool hash is changed. DS overhead is factored in already, don't think we need to change anything there

jukie commented 2 months ago

I'd like to start an RFC on this, does this sound like a reasonable start?

~If it hasn't been mentioned or discussed elsewhere, could a feature request be opened with the EC2 team and SDK maintainers to provide this information in the DescribeInstanceTypes API?~

youwalther65 commented 2 months ago

@jukie I generally agree with your proposal. But I have to disagree with the idea of providing this information in EC2 API DescribeInstanceTypes. The overhead is based on many factors like AMI type and version, kernel version and modules used and many others. So it does not make sense to put a reasonable "overhead" value into the API. So I would even go one step further and implement the cache for overhead not only based on instance type but rather a combination of at least instance type and AMI version.

jukie commented 2 months ago

Thanks for explaining that, I hadn't given it enough thought! What you've mentioned for a cache mapping of instance-type + ami version to overhead makes sense.

jukie commented 1 month ago

Caching per-ami is proving to be difficult as that would introduce some coupling. Specifically here https://github.com/aws/karpenter-provider-aws/blob/main/pkg/cloudprovider/cloudprovider.go#L286-L288 I'm not sure where to store and/or how to compare an instance type that maps to a particular AMI version.

I've opened #7004 which would use the nodeClass hash as a key and the latest version would always be used. However I don't think AMI changes trigger a hash change so I'm wondering what makes the most sense to use instead.

jukie commented 1 month ago

@youwalther65 would it be better to use a cache key with instance type name + hash of the AMI names in NodeClass status?

That might be the safest route but would mean the first node that gets launched after an AMI change would fallback to vmMemoryOverheadPercent.

Edit: It would actually be every node until the controller requeues (12hrs) so will adjust again (done)

youwalther65 commented 1 month ago

@jonathan-innis What do you think about the idea from @jukie ?

jukie commented 1 month ago

Would it make sense to bring this up as a topic during the working group meeting?