Closed jonathan-innis closed 1 week ago
This is a "transferred" version of the issue in kubernetes-sigs/karpenter
: https://github.com/kubernetes-sigs/karpenter/issues/716
Could a simple mitigation be to add a static setting? e.g. vmMemoryOverheadOffset
, which would take a static amount of memory?
When running compute which is highly heterogeneous, but relatively large, this would simplify the practice of managing this setting, because a sufficiently large static overhead would be able to outweigh the percentage, and allow for less waste.
The vmMemoryOverheadPercent
value is clearly not one-size-fits-all, so making it per-nodepool would be a great start.
I prefer approach #
Here's how I imagine this could be implemented: a) Convert it to something like "initvmMemoryOverheadPercent" and make it a per-nodepool setting. b) As soon as we discover or launch a single node of any given nodepool/instanceType pair, the node allocatable should be dynamically discovered and stored in memory or optionally in some form of permanent storage. If Karpenter restarts, it can rediscover this information. All systemReserved/kubeReserved and Bottlerocket's control/admin containers subtract from the allocatable, and should already be considered in the proper allocatable exposed by kubelet. Karpenter already is able to calculate daemonset overhead, so this can be factored in
The remaining issue is that if the value is too high, Karpenter won't be able to launch the node even though the capacity is there. #
ADD:
Just dug into cluster-autoscaler, questioning, how come there isn't such an issue?
I've done some testing: tried to schedule a pod with memory request that is total instance memory - DS overhead
, and scale from 0
Turns out, the issue is there: CA tries to launch an instance for each of the matching ASG and pod is still pending. CA then kills the instance
So the issue was just hidden and much less likely to be encountered, because: a) cluster-autoscaler is pretty smart in that it would cache "true allocatable" as soon as it sees first node from ASG for the nodegroup, and doesn't retry it if it doesn't fit b) CA relies on same resource on every node in a nodegroup, so if it sees 1 node, it knows the config for whole nodegroup. (not something karpenter can rely on hence the need to discover and cache "nodepool/instanceType" pairs.
@jonathan-innis Any latest updates on this issue.
Agree in that CA has to do an easier job, but even CA does include some delta to accommodate for slight changes between instance types in same or similar NodePool/ASG etc (something just visible in main.go and not yet described in CA FAQ) Having a combination of NodePool and instance type might be somehow safe, because a NodePool is tight to an EC2NodeClass. But underlying AMI version of NodePool might change over time and AMI and/or custom user data can tweak kubeReserved and systemReserved in different ways leading to new values of allocatable. So we should think of re-evaluate the cached value. Another point just to mention is how DaemonSet overhead is factored in: Some DS can be applied to only a subset of nodes via taints/tolerations, so
So we should think of re-evaluate the cached value.
cached value shouldn't be static, it could get updated during node registration. Cache can be flushed if nodepool hash is changed. DS overhead is factored in already, don't think we need to change anything there
I'd like to start an RFC on this, does this sound like a reasonable start?
vmMemoryOverheadPercent
vmMemoryOverheadPercent
calculations~If it hasn't been mentioned or discussed elsewhere, could a feature request be opened with the EC2 team and SDK maintainers to provide this information in the DescribeInstanceTypes
API?~
@jukie I generally agree with your proposal. But I have to disagree with the idea of providing this information in EC2 API DescribeInstanceTypes. The overhead is based on many factors like AMI type and version, kernel version and modules used and many others. So it does not make sense to put a reasonable "overhead" value into the API. So I would even go one step further and implement the cache for overhead not only based on instance type but rather a combination of at least instance type and AMI version.
Thanks for explaining that, I hadn't given it enough thought! What you've mentioned for a cache mapping of instance-type + ami version to overhead makes sense.
Caching per-ami is proving to be difficult as that would introduce some coupling. Specifically here https://github.com/aws/karpenter-provider-aws/blob/main/pkg/cloudprovider/cloudprovider.go#L286-L288 I'm not sure where to store and/or how to compare an instance type that maps to a particular AMI version.
I've opened #7004 which would use the nodeClass hash as a key and the latest version would always be used. However I don't think AMI changes trigger a hash change so I'm wondering what makes the most sense to use instead.
@youwalther65 would it be better to use a cache key with instance type name + hash of the AMI names in NodeClass status?
That might be the safest route but would mean the first node that gets launched after an AMI change would fallback to vmMemoryOverheadPercent.
Edit: It would actually be every node until the controller requeues (12hrs) so will adjust again (done)
@jonathan-innis What do you think about the idea from @jukie ?
Would it make sense to bring this up as a topic during the working group meeting?
Description
Tell us about your request
We could consider a few options to discover the expected capacity overhead for a given instance type:
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Calculating the difference between the EC2-reported memory capacity and the actual capacity of the instance as reported by kubelet.
Are you currently working around this issue?
Using a heuristic
vmMemoryOverheadPercent
value right now that is tunable by users and passed throughkarpenter-global-settings
Community Note