Open ellistarn opened 3 years ago
In the short term, we are planning to add these new instances to our existing list so the correct max_pods
is determined by default. In the longer term, this is something that we and the EKS team have been trying to sort out.
- The latest CNI (with IP Prefix Delegation) support significantly increased max pods per node.
This definitely helps and is something we support, but because the node cannot determine CNI type/version, we haven't been able to figure this out dynamically. Instead, when using this feature, customers will need to specify the max_pods
manually and can determine the right number via the helper script EKS maintains.
- Overlay networks like Calico and Cilium support significantly increased
max_pods
per node.
They do, but again, we don't have a way to figure that out on the node when determining max_pods
so we're left with needing customers to tell us what the right value is.
- Moving forward, I expect that the MaxPods limit will become bound by kubelet cpu/memory overhead instead of AWS VPC limitations.
You may be right here, but until we know we're no longer limited by ENI attachment, changing this behavior to key off of core count or memory size could make sense, but I'm wary of the effect it may have on existing customers and their expectations.
I'd love to see an option in Bottlerocket to assume that the CNI is not the limiting factor for max pods, and instead compute max pods based off of cpu/memory of the instance (e.g. max-pods-per-core: 10), which can cheaply be discovered at boot
This new setting is an interesting option, however, it seems to me that it would need to be calculated by the customer based on their choice of CNI plug-in, delegated prefix size, and instance type choice. Which leaves us in a similar state to where we are today but with a more abstracted setting to configure max_pods
.
I'd like to keep this issue open and updated while we continue to explore our options for making this work more cleanly for our customers.
The main challenge of specifying max_pods
is that it must be done in a launch template's user data. In Karpenter, we launch many different instance types, which means we'd be forced to create an equal number of launch templates. Using a relative value rather than an absolute value (e.g. max_pods -> max_pods_per_core) allows us to reason generally using these launch templates and reduce the number of templates we need to create.
What are your thoughts about building this as an option? You could even support multiple options for max pods like you do today, but just add another option.
@ellistarn how come you want to use more than 110 pods on a node, the official K8s guidance is a maximum of 110 pods per node?
I was going to open a new issue to suggest that the default logic should be to use 110, with a pattern to use for running the AWS VPC CNI in legacy non-prefix (IPv4 or IPv6) mode to set the limit for instances which can't support 110 pods. But thought it'd be best to continue the discussion here.
We also need to talk about setting the kube reserved values, these should be directly linked to the node resources so need to be dynamically generated. The GKE docs are the best reference for this, the AL2 AMI already uses this for it's CPU calculation.
I also opened https://github.com/awslabs/amazon-eks-ami/issues/782 for the AL2 AMI but haven't had any responses yet.
I'm open minded to this idea. @riskyadventure has been discussing with me the perf impact of context switching between so many processes, at least when under load. Historically, k8s-on-aws has offered much higher densities, so this would be backwards incompatible. We'd need to figure out a glide path, and probably want the same solution for EKSOptimizedAMI and Bottlerocket.
I'm open minded to this idea. @RiskyAdventure has been discussing with me the perf impact of context switching between so many processes, at least when under load. Historically, k8s-on-aws has offered much higher densities, so this would be backwards incompatible. We'd need to figure out a glide path, and probably want the same solution for EKSOptimizedAMI and Bottlerocket.
I agree, this should be a consistent default across all of the EKS ecosystem which should align with K8s best practice (the current AL2 EKS defaults will cause actual issues in relative common scenarios).
Following up on this thread as I'm picking up work for Karpenter to enable a podsPerCore
feature which is currently something that is supported directly by the kubelet. Any reason that bottlerocket has chosen not to implement this feature and has instead opted to just expose the maxPods
value for Kubernetes configuration?
Any reason that bottlerocket has chosen not to implement this feature and has instead opted to just expose the maxPods value for Kubernetes configuration?
There's no specific reason; it's only that the implementation hasn't been picked up, not that there's been an intentional choice not to go the pods-per-core
direction. The max-pods
setting has been around since close to the beginning, and the kube-reserved
settings are layered on top, so it's a bit of a tangle.
It sounds like the work here is:
settings.kubernetes.pods-per-core
so this can be specified in user datapods-per-core
into account when calculating kube-reserved
defaultsI'm not clear on what should be done with the max-pods
default. If it's left unset then kubelet
will just default to 110, potentially causing pods-per-core
to be ignored, which seems like an unpleasant surprise.
An updated approach for calculating max-pods
might be:
max-pods
is specified, use that valuepods-per-core
is specified, then max-pods = pods-per-core * num-cores
pods-per-core
is unspecified, then max-pods
uses the ENI-related defaults on aws-k8s-*
That would be straightforward and compatible with existing deployments, unless there's a strong case to be made for a pods-per-core
default value.
@ellistarn , @stevehipwell - what do you think?
@bcressey I think it's worth splitting out the actual behaviour requirements and then working back up from there to get the settings spec; I would list these as max-pods, kube-reserved memory and kube-reserved CPU. I think @bwagner5 might have some thoughts on this and has been testing some of the options.
The basic requirements for max pods are as follows, I would argue that a default of 110
would be best based on the Considerations for large clusters documentation.
For the implementation, pods per core and ENI modes could use the number as a constraint if it was set meaning that my suggested default of 110
would be implicit so as to not interfere with any explicit input.
The legacy ENI behaviour for kube-reserved memory is based on the ENI IP logic and is a function of pods rather than a function of system memory as found in other implementations; this wasn't updated for ENI prefix pod counts for AL2 as it appears to have been here. The GKE implementation (also used by AKS amongst others) would be my preferred logic and this is what I manually specify when running EKS nodes. Pod Overhead can be used in combination with this. This value needs to be configurable as a dynamic value to be able to use a single configuration across different node sizes with Karpenter.
Everyone already uses the GKE logic for this, so it's also the legacy ENI logic. As above Pod Overhead can be used in combination with this. This value needs to be configurable as a dynamic value to be able to use a single configuration across different node sizes with Karpenter.
I think the following inputs and values would cover all of the use cases above.
settings.kubernetes.max-pods-mode
default
- Not AWS ENIaws-eni
- Use the current Pluto implementation, could be default for AWS if nothing else set for max podssettings.kubernetes.max-pods
110
if not set and in default mode, if set in AWS ENI mode it should be the upper boundssettings.kubernetes.pods-per-core
settings.kubernetes.max-pods
if it is setsettings.kubernetes.kube-reserved-memory-calc
max-pods
- Current ENI behavioursystem-memory
- GKE logicsettings.kubernetes.kube-reserved-cpu-calc
system-cpu
- Current behaviourMy end-user requirements for this would be to be able to set my nodes to have a max-pods of 110
and to have kube-reserved set according to the GKE logic from a single configuration for any size node.
[settings.kubernetes]
"max-pods-mode" = "default"
"max-pods" = 110
"kube-reserved-memory-calc" = "system-memory"
"kube-reserved-cpu-calc" = "system-cpu"
[settings.kubernetes.kube-reserved]
ephemeral-storage = "1Gi"
I think in general this all makes sense based on the use cases. In terms of the API surface, is there a reason that kube-reserved-memory-calc
and kube-reserved-cpu-calc
are sitting inside of the settings.kubernetes
rather than settings.kubernetes.kube-reserved
since it is scoped to this anyway. Otherwise I think your overall logic for max-pods-mode
, max-pods
and pods-per-core
all makes sense based on the requirements
[settings.kubernetes]
"max-pods-mode" = "default"
"max-pods" = 110
[settings.kubernetes.kube-reserved]
ephemeral-storage = "1Gi"
memory-calc-mode = "system-memory"
cpu-calc-mode = "system-cpu"
@jonathan-innis the API design was to illustrate my points with the expectation that the actual design would be different. I agree that if possible nesting the values would be better.
Sure, with that in mind, what you put sounds reasonable to me
Missed this thread, but this all sounds pretty reasonable to me as well. There is one more case for the VPC CNI "Custom Networking" configured via the ENIConfig CRD where you want nodes to have a different subnet/SG than pods which alters the ENI limited calcs since you lose a usable ENI for pods.
I'd also hope that overtime the ENI limited max pods definition would become less relevant and the maximum amount of pods a node can handle would be based on resource limits, in which case 110 is probably a reasonable upper bound.
I have done some preliminary data gathering on the overhead difference for kube-reserved memory that is not based on ENI limited max-pods here. I haven't run tests on these yet and I think we'd need to do quite a bit of testing to make sure this wouldn't surprise users (or allow an escape hatch).
I think we can definitely decouple these two features:
podsPerCore
to settings.kubernetes
In the case of podsPerCore
, I would see this as the higher priority (and easier) item to implement. This brings parity with kubelet
by surfacing more configuration parameters that are already available for users who have manually configured their AMIs to handle themselves.
On top of that, I don't see a reason to add a max-pods-mode
. Primarily because we can assume that the user is using ENI-limited max pods unless they have explicitly set the max-pods
value to be 110
or some other value.
It feels like we can punt both the GKE calculation decision and the max-pods-mode
decision in favor of just adding podsPerCore
for now and just calculating the kube-reserved
based off of pods which would be
pods = min(max-pods, pods-per-core * vCPUs)
@jonathan-innis , do you still think https://kubernetes.io/docs/concepts/containers/runtime-class/#pod-overhead may be a fruitful route?
I think from a kube-reserved
perspective, pod-overhead
may make sense through the use of RuntimeClass but it does require additional configuration on the user side, including an addition Kyverno policy or mutating webhook for injecting the runtimeClassName
into sets of pods.
The other consideration is that AL2 and Ubuntu still will not support the GKE calculation for kube-reserved
so it may make sense to migrate or surface additional configuration in Bottlerocket when this support exists over there or there is enough customer ask for it.
I don't want to lose the focus of the discussion, though, which is arguing that if there is configuration that is surfaced in Kubelet Configuration, Bottlerocket should also support it to make the experience as seamless as possible for users.
@jonathan-innis kube-reserved
should be decoupled from max pods by default, it's only the AWS implementation which couples them together (incorrectly in my opinion); so implementing the GKE kube-reserved
calculation makes this a viable option.
On top of that, I don't see a reason to add a max-pods-mode. Primarily because we can assume that the user is using ENI-limited max pods unless they have explicitly set the max-pods value to be 110 or some other value.
I think @bwagner5 highlighted the reason why you need this with the VPC CNI custom networking mode.
It feels like we can punt both the GKE calculation decision and the max-pods-mode decision in favor of just adding podsPerCore for now and just calculating the kube-reserved based off of pods which would be
I don't think these two features are linked in this way, they have interactions but aren't mutually exclusive as I described in https://github.com/bottlerocket-os/bottlerocket/issues/1721#issuecomment-1209254902. I see podsPerCore
as constrained by max pods if both are set, and if GKE mode is in use it has no impact on kube reserved.
I think @bwagner5 highlighted the reason why you need this with the VPC CNI custom networking mode.
I don't think this is a reason to have a max-pods-mode
, this is just another thing that should be taken into consideration when discovering the max-pods on the cluster.
I don't think these two features are linked in this way
Can you explain how they aren't linked in this way? We are currently doing the calculation based on pods so we should continue to do the calculation based on the max pods based on the min of the max-pods and pods-per core as described in this equation
pods = min(max-pods, pods-per-core * vCPUs)
I see podsPerCore as constrained by max pods if both are set, and if GKE mode is in use it has no impact on kube reserved.
Completely agree with this
I think @bwagner5 highlighted the reason why you need this with the VPC CNI custom networking mode.
I don't think this is a reason to have a max-pods-mode, this is just another thing that should be taken into consideration when discovering the max-pods on the cluster.
@jonathan-innis I'm pretty sure I've heard this reason given for why the way AL2 currently works cannot be changed, if you can automate the discovery of the VPC CNI and it's specific mode (legacy, custom networking, IPv4 prefix, or IPv6) then I agree that this isn't needed.
I don't think these two features are linked in this way
Can you explain how they aren't linked in this way? We are currently doing the calculation based on pods so we should continue to do the calculation based on the max pods based on the min of the max-pods and pods-per core as described in this equation
Because the GKE calculation doesn't use max pods in it's algorithm, so adding the capability to set a max pods per core isn't related to using the GKE calculation. So what I'm saying is the implementing podsPerCore
doesn't address the use case for adding the GKE algorithm either way.
In my opinion the GKE algorithm should be the highest priority here as it removes the requirement to couple max pods to kube reserved, which means when support for podsPerCore
is added it can be tested in a way that only impacts one configuration component. Also unless you know differently the upstream kubelet configuration would generally assume the GKE algorithm (or something similar) was being used for kube reserved.
I am biased as it is the GKE kube reserved functionality that is currently blocking me as I don't want to have to figure out creating a custom configuration container to set kube reserved and then figure out how to get it to work with Karpenter.
Has there been any more progress on this? Specifically towards an API design similar to in my comment above and copied here.
[settings.kubernetes]
"max-pods-mode" = "default"
"max-pods" = 110
"kube-reserved-memory-calc" = "system-memory"
"kube-reserved-cpu-calc" = "system-cpu"
[settings.kubernetes.kube-reserved]
"ephemeral-storage" = "1Gi"
I see this has been removed from the backlog, is there a reason why?
@jonathan-innis I don't suppose you could point me in the right direction to override the kube-reserved at runtime?
Sorry about the churn. We aren't using the backlog
milestone anymore, so I removed it from there. This is still something that looks interesting. Just need a plan and someone that can take it forward.
@dmitmasy another one for your awareness and prioritization.
@stmcginnis is there an example of modifying the settings from within a bootstrap container which I could use as the basis for a temporary solution?
Not that I know of. /etc/kubernetes/kubelet/config
would need to be updated, so it would need to happen after the template for that config file has been rendered during start up. It could be edited and kubelet
restarted to pick up the change.
I think settings.kubernetes.max-pods
is probably the safest approach for now. Which would mean precalculating what you want that value to be and setting it in the user data so the kubelet configuration uses that rather than the default.
@stmcginnis I hardcode max pods to 110
but want to customise my kube-reserved values. I was expecting to be able to do something like below in a bootstrap container?
apiclient set \
kubernetes.kube-reserved.cpu="${kube_reserved_cpu}m" \
kubernetes.kube-reserved.memory="${kube_reserved_memory}Mi"
As an aside and I might not be looking hard enough but I expected to find a container in the public ECR which would allow me to bootstrap Bottlerocket directly from userdata.
You could set those values as well.
Typically these settings are provided at boot time in the user data, so there isn't a need to have a separate bootstrap container to make these apiclient
calls. That is either provided in the "User Data" field in the AWS console, as a user-data.toml file in bare metal or aws ec2
cli, or as yaml values in an eksctl
config file.
@stmcginnis that's what I do for ASGs as the defaults are wrong enough to brick nodes. But with Karpenter the userdata needs to be generic as the instance can be any size. This issue tracks the actual details, but I'm currently trying to POC dynamically setting these as I can do for AL2.
Yeah, in that case unfortunately you will have to go with the container route to call apiclient
with the necessary settings. :/
I think #2010 is relevant here.
@stevehipwell did you find a way to dynamically set cpu / memory for kube-reserved?
@olfway I've been waiting for Karpenter to support alternative kube-reserved calculations. But AFAIK there is still no official ECR image with the correct entrypoint for this.
Sorry for the silly question but just to make sure I uderstand correctly: does this issue mean that Bottlerocket with prefix delegation is not handling max-pods correctly? I have a situation in my EKS where I tried to add spec.kubelet.maxPods=110 to the NodePool, eg:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: Bottlerocket
amiSelectorTerms:
- alias: bottlerocket@latest
kubelet:
maxPods: 110
blockDeviceMappings:
# Root device
- deviceName: /dev/xvda
ebs:
volumeSize: 4Gi
volumeType: gp3
encrypted: true
# Data device: Container resources such as images and logs
- deviceName: /dev/xvdb
ebs:
volumeSize: 20Gi
volumeType: gp3
encrypted: true
role: eks-management-karpenter-node-role
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: management
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: management
tags:
karpenter.sh/discovery: management
This way nodes are created with a huge amount of system reserved memory (40%) and they get packed up with Pods, which leads to Pod evictions.
@irizzant unless something has changed the reserved values are tied to the arbitrary number of pods AWS says can run on a specific node type, and this uses the pre prefix logic where the instance type ENI count limits pod density. This also ignores the actual user defined max pods if set
This logic is incorrect for a number of reasons. Not respecting the user defined max pods makes zero sense. Prefix mode makes the pod density limitations disappear. K8s officially has a 110 max pods per node recommendation, and as all nodes under prefix mode can support 110 pods dynamically this should be the default.
Also as the per pod calculation was run on AL2 using Docker as the CRI a number of years ago it should be recalculated for Bottlerocket.
My recommendation would be to always explicitly set max pods and kube reserved. I make the reserved a function of max pods and have run tests using pods with various container counts to come up with a value.
@stevehipwell thanks for the reply but I'm confused. As far as I know Karpenter already assigns a kube reserved amount of resources based on the instance type, which should be ok in theory: https://github.com/aws/karpenter-provider-aws/blob/f0dbb2689871d6da851a89090fb738b5476393af/pkg/providers/instancetype/types.go#L369C1-L402C2
Also EKS AMI has its own logic: https://github.com/awslabs/amazon-eks-ami/blob/4b54ee95d42df8a2715add2a32f5150db097fde8/files/bootstrap.sh#L247-L251
When I say max pod = 110 I expect reserved memory to be around 1.2 Gi following EKS AMI logic, which is a lot for small nodes.
Are you saying that when I add max pod = 110 Karpenter logic is overriden by EKS AMI? Or is it Karpenter logic that kicks in and ends up with a wrong calculation and this is why I have to add custom values?
@irizzant I was specifically referring to the AMI logic, not the Karpenter logic. Last time I dug into the details Karpenter was mirroring this behaviour, but I've switched approach and haven't been following closely for a few months.
As Karpenter uses the fleet API I'm not sure it can set dynamic reserved values by instance type as it doesn't know the actual types in advance. I think any calculations it's doing are to mirror the AMI defaults.
When I'm back on a reasonable size screen I'll take a look at the logic in detail.
@stevehipwell thanks, I've checked myself and it looks like you're right about AMI logic being the same as Karpenter's, it uses:
memory_to_reserve=$((11 * $max_num_pods + 255))
which again it's too much for small nodes. So you're suggesting to add max pods = 110 and use a fixed value for kube reserved which is suitable for nodes over a certain size right?
Also, I may be wrong but I don't remember experiencing issues with Pod density before upgrading to Karpenter v1.
Maybe this issue is related ? https://github.com/aws/karpenter-provider-aws/issues/6890
I'm currently trying to test my above mentioned EC2NodeClass
with Karpenter v0.37 to check if anything is different
@stevehipwell I confirm that this is not releated to Karpenter v1, the previous version v0.37 gives the same result.
Given the absence of a way to set dynamic reserved values by instance type I suppose the only way to proceed is to set kube reserved values manually and adjust NodeClass to be big enough to tolerate the chosen values right?
@stevehipwell thanks for the updates, as I wrote in https://github.com/awslabs/amazon-eks-ami/issues/1145#issuecomment-2393466340 I still don't understand how to pass dynamic values in Karpenter NodeClasses with bootstrap commands
@irizzant I'm not sure if I fully understand your question, but to implement this via Karpenter you'd need a command that can do the dynamic work and then pass that in via user data.
@stevehipwell let me try to explain. The bootstrap commands doc shows you can use apiclient
to interact with Bottlerocket.
Is it possible to query the instance data (cpu, memory, ... ) using apiclient
and then do a calculation based on these parameters?
If yes then I'd pass the calculation to Karpenter NodeClass using user data and we're good.
Otherwise can you please detail how to proceed?
@irizzant that is something that I think you'd need to speak to the maintainers about. I'm not sure what data the apiclient
has access to.
What I'd like:
I ran into am issue when using https://github.com/awslabs/karpenter and Bottlerocket with the new c6i instance types (see: https://github.com/bottlerocket-os/bottlerocket/issues/1720). Karpenter simulates pod scheduling and provisions instances by discovering instance types from EC2 and binpacking pods onto the node. It uses a formula (14 * AttachableENICount) to compute the max pods value. It also binds the pods before the node comes online as an optimization. If Bottlerocket is unaware of a new instance type, it will default to MaxPods of 110, which is woefully short of the actual number of pods that can be scheduled using the AWS VPC CNI. This causes Karpener's scheduling logic to disagree with the Node's MaxPods, and the Kubelet reports OutOfPods errors for the bound pods.
It's challenging for Bottlerocket to come up with a good default here (i.e. 110), since you have no idea whether or not the unknown instance type is large or small. Calling EC2 Describe Instances on OS boot will introduce significant call volume from large scale ups, as well as increase startup latency.
I see a decent path forward:
I'd love to see an option in Bottlerocket to assume that the CNI is not the limiting factor for max pods, and instead compute max pods based off of cpu/memory of the instance (e.g.
max-pods-per-core: 10
), which can cheaply be discovered at bootAny alternatives you've considered: Looking forward to discussing this path or any others you folks identify.