Decouple MaxPods from CNI Logic

ellistarn commented 3 years ago

What I'd like:

I ran into am issue when using https://github.com/awslabs/karpenter and Bottlerocket with the new c6i instance types (see: https://github.com/bottlerocket-os/bottlerocket/issues/1720). Karpenter simulates pod scheduling and provisions instances by discovering instance types from EC2 and binpacking pods onto the node. It uses a formula (14 * AttachableENICount) to compute the max pods value. It also binds the pods before the node comes online as an optimization. If Bottlerocket is unaware of a new instance type, it will default to MaxPods of 110, which is woefully short of the actual number of pods that can be scheduled using the AWS VPC CNI. This causes Karpener's scheduling logic to disagree with the Node's MaxPods, and the Kubelet reports OutOfPods errors for the bound pods.

It's challenging for Bottlerocket to come up with a good default here (i.e. 110), since you have no idea whether or not the unknown instance type is large or small. Calling EC2 Describe Instances on OS boot will introduce significant call volume from large scale ups, as well as increase startup latency.

I see a decent path forward:

The latest CNI (with IP Prefix Delegation) support significantly increased max pods per node.
Overlay networks like Calico and Cilium support significantly increased max pods per node.
Moving forward, I expect that the MaxPods limit will become bound by kubelet cpu/memory overhead instead of AWS VPC limitations.

I'd love to see an option in Bottlerocket to assume that the CNI is not the limiting factor for max pods, and instead compute max pods based off of cpu/memory of the instance (e.g. max-pods-per-core: 10), which can cheaply be discovered at boot

Any alternatives you've considered: Looking forward to discussing this path or any others you folks identify.

jhaynes commented 3 years ago

In the short term, we are planning to add these new instances to our existing list so the correct max_pods is determined by default. In the longer term, this is something that we and the EKS team have been trying to sort out.

The latest CNI (with IP Prefix Delegation) support significantly increased max pods per node.

This definitely helps and is something we support, but because the node cannot determine CNI type/version, we haven't been able to figure this out dynamically. Instead, when using this feature, customers will need to specify the max_pods manually and can determine the right number via the helper script EKS maintains.

Overlay networks like Calico and Cilium support significantly increased max_pods per node.

They do, but again, we don't have a way to figure that out on the node when determining max_pods so we're left with needing customers to tell us what the right value is.

Moving forward, I expect that the MaxPods limit will become bound by kubelet cpu/memory overhead instead of AWS VPC limitations.

You may be right here, but until we know we're no longer limited by ENI attachment, changing this behavior to key off of core count or memory size could make sense, but I'm wary of the effect it may have on existing customers and their expectations.

I'd love to see an option in Bottlerocket to assume that the CNI is not the limiting factor for max pods, and instead compute max pods based off of cpu/memory of the instance (e.g. max-pods-per-core: 10), which can cheaply be discovered at boot

This new setting is an interesting option, however, it seems to me that it would need to be calculated by the customer based on their choice of CNI plug-in, delegated prefix size, and instance type choice. Which leaves us in a similar state to where we are today but with a more abstracted setting to configure max_pods.

I'd like to keep this issue open and updated while we continue to explore our options for making this work more cleanly for our customers.

ellistarn commented 3 years ago

The main challenge of specifying max_pods is that it must be done in a launch template's user data. In Karpenter, we launch many different instance types, which means we'd be forced to create an equal number of launch templates. Using a relative value rather than an absolute value (e.g. max_pods -> max_pods_per_core) allows us to reason generally using these launch templates and reduce the number of templates we need to create.

What are your thoughts about building this as an option? You could even support multiple options for max pods like you do today, but just add another option.

Use MaxPods if set
Use MaxPodsPerCore if set
Use the instancetype->maxpod mapping
Fall back to 110 (or something reasonable, perhaps # 2?)

stevehipwell commented 2 years ago

@ellistarn how come you want to use more than 110 pods on a node, the official K8s guidance is a maximum of 110 pods per node?

I was going to open a new issue to suggest that the default logic should be to use 110, with a pattern to use for running the AWS VPC CNI in legacy non-prefix (IPv4 or IPv6) mode to set the limit for instances which can't support 110 pods. But thought it'd be best to continue the discussion here.

We also need to talk about setting the kube reserved values, these should be directly linked to the node resources so need to be dynamically generated. The GKE docs are the best reference for this, the AL2 AMI already uses this for it's CPU calculation.

I also opened https://github.com/awslabs/amazon-eks-ami/issues/782 for the AL2 AMI but haven't had any responses yet.

ellistarn commented 2 years ago

I'm open minded to this idea. @riskyadventure has been discussing with me the perf impact of context switching between so many processes, at least when under load. Historically, k8s-on-aws has offered much higher densities, so this would be backwards incompatible. We'd need to figure out a glide path, and probably want the same solution for EKSOptimizedAMI and Bottlerocket.

stevehipwell commented 2 years ago

I'm open minded to this idea. @RiskyAdventure has been discussing with me the perf impact of context switching between so many processes, at least when under load. Historically, k8s-on-aws has offered much higher densities, so this would be backwards incompatible. We'd need to figure out a glide path, and probably want the same solution for EKSOptimizedAMI and Bottlerocket.

I agree, this should be a consistent default across all of the EKS ecosystem which should align with K8s best practice (the current AL2 EKS defaults will cause actual issues in relative common scenarios).

jonathan-innis commented 2 years ago

Following up on this thread as I'm picking up work for Karpenter to enable a podsPerCore feature which is currently something that is supported directly by the kubelet. Any reason that bottlerocket has chosen not to implement this feature and has instead opted to just expose the maxPods value for Kubernetes configuration?

bcressey commented 2 years ago

Any reason that bottlerocket has chosen not to implement this feature and has instead opted to just expose the maxPods value for Kubernetes configuration?

There's no specific reason; it's only that the implementation hasn't been picked up, not that there's been an intentional choice not to go the pods-per-core direction. The max-pods setting has been around since close to the beginning, and the kube-reserved settings are layered on top, so it's a bit of a tangle.

It sounds like the work here is:

add settings.kubernetes.pods-per-core so this can be specified in user data
leave the default unset, unless there's a reasonable default value (is there?)
take pods-per-core into account when calculating kube-reserved defaults

I'm not clear on what should be done with the max-pods default. If it's left unset then kubelet will just default to 110, potentially causing pods-per-core to be ignored, which seems like an unpleasant surprise.

An updated approach for calculating max-pods might be:

if max-pods is specified, use that value
if pods-per-core is specified, then max-pods = pods-per-core * num-cores
if pods-per-core is unspecified, then max-pods uses the ENI-related defaults on aws-k8s-*

That would be straightforward and compatible with existing deployments, unless there's a strong case to be made for a pods-per-core default value.

@ellistarn , @stevehipwell - what do you think?

stevehipwell commented 2 years ago

@bcressey I think it's worth splitting out the actual behaviour requirements and then working back up from there to get the settings spec; I would list these as max-pods, kube-reserved memory and kube-reserved CPU. I think @bwagner5 might have some thoughts on this and has been testing some of the options.

Max Pods

The basic requirements for max pods are as follows, I would argue that a default of 110 would be best based on the Considerations for large clusters documentation.

A number of pods
A number of pods per core
ENI IP logic
ENI prefix logic

For the implementation, pods per core and ENI modes could use the number as a constraint if it was set meaning that my suggested default of 110 would be implicit so as to not interfere with any explicit input.

Kube Reserved Memory

The legacy ENI behaviour for kube-reserved memory is based on the ENI IP logic and is a function of pods rather than a function of system memory as found in other implementations; this wasn't updated for ENI prefix pod counts for AL2 as it appears to have been here. The GKE implementation (also used by AKS amongst others) would be my preferred logic and this is what I manually specify when running EKS nodes. Pod Overhead can be used in combination with this. This value needs to be configurable as a dynamic value to be able to use a single configuration across different node sizes with Karpenter.

GKE logic
ENI logic (legacy)
Fixed value

Kube Reserved CPU

Everyone already uses the GKE logic for this, so it's also the legacy ENI logic. As above Pod Overhead can be used in combination with this. This value needs to be configurable as a dynamic value to be able to use a single configuration across different node sizes with Karpenter.

GKE logic
Fixed value

Implementation

I think the following inputs and values would cover all of the use cases above.

settings.kubernetes.max-pods-mode
- default - Not AWS ENI
- aws-eni - Use the current Pluto implementation, could be default for AWS if nothing else set for max pods
settings.kubernetes.max-pods
- Number, implicitly 110 if not set and in default mode, if set in AWS ENI mode it should be the upper bounds
settings.kubernetes.pods-per-core
- Number, only in default mode, constrained by settings.kubernetes.max-pods if it is set
settings.kubernetes.kube-reserved-memory-calc
- max-pods - Current ENI behaviour
- system-memory - GKE logic
settings.kubernetes.kube-reserved-cpu-calc
- system-cpu - Current behaviour

My end-user requirements for this would be to be able to set my nodes to have a max-pods of 110 and to have kube-reserved set according to the GKE logic from a single configuration for any size node.

[settings.kubernetes]
"max-pods-mode" = "default"
"max-pods" = 110
"kube-reserved-memory-calc" = "system-memory"
"kube-reserved-cpu-calc" = "system-cpu"

[settings.kubernetes.kube-reserved]
ephemeral-storage = "1Gi"

jonathan-innis commented 2 years ago

I think in general this all makes sense based on the use cases. In terms of the API surface, is there a reason that kube-reserved-memory-calc and kube-reserved-cpu-calc are sitting inside of the settings.kubernetes rather than settings.kubernetes.kube-reserved since it is scoped to this anyway. Otherwise I think your overall logic for max-pods-mode, max-pods and pods-per-core all makes sense based on the requirements

[settings.kubernetes]
"max-pods-mode" = "default"
"max-pods" = 110

[settings.kubernetes.kube-reserved]
ephemeral-storage = "1Gi"
memory-calc-mode = "system-memory"
cpu-calc-mode = "system-cpu"

stevehipwell commented 2 years ago

@jonathan-innis the API design was to illustrate my points with the expectation that the actual design would be different. I agree that if possible nesting the values would be better.

jonathan-innis commented 2 years ago

Sure, with that in mind, what you put sounds reasonable to me

bwagner5 commented 2 years ago

Missed this thread, but this all sounds pretty reasonable to me as well. There is one more case for the VPC CNI "Custom Networking" configured via the ENIConfig CRD where you want nodes to have a different subnet/SG than pods which alters the ENI limited calcs since you lose a usable ENI for pods.

I'd also hope that overtime the ENI limited max pods definition would become less relevant and the maximum amount of pods a node can handle would be based on resource limits, in which case 110 is probably a reasonable upper bound.

I have done some preliminary data gathering on the overhead difference for kube-reserved memory that is not based on ENI limited max-pods here. I haven't run tests on these yet and I think we'd need to do quite a bit of testing to make sure this wouldn't surprise users (or allow an escape hatch).

jonathan-innis commented 2 years ago

I think we can definitely decouple these two features:

GKE kube-reserved calculation
Adding podsPerCore to settings.kubernetes

In the case of podsPerCore, I would see this as the higher priority (and easier) item to implement. This brings parity with kubelet by surfacing more configuration parameters that are already available for users who have manually configured their AMIs to handle themselves.

On top of that, I don't see a reason to add a max-pods-mode. Primarily because we can assume that the user is using ENI-limited max pods unless they have explicitly set the max-pods value to be 110 or some other value.

It feels like we can punt both the GKE calculation decision and the max-pods-mode decision in favor of just adding podsPerCore for now and just calculating the kube-reserved based off of pods which would be

pods = min(max-pods, pods-per-core * vCPUs)

ellistarn commented 2 years ago

@jonathan-innis , do you still think https://kubernetes.io/docs/concepts/containers/runtime-class/#pod-overhead may be a fruitful route?

jonathan-innis commented 2 years ago

I think from a kube-reserved perspective, pod-overhead may make sense through the use of RuntimeClass but it does require additional configuration on the user side, including an addition Kyverno policy or mutating webhook for injecting the runtimeClassName into sets of pods.

The other consideration is that AL2 and Ubuntu still will not support the GKE calculation for kube-reserved so it may make sense to migrate or surface additional configuration in Bottlerocket when this support exists over there or there is enough customer ask for it.

I don't want to lose the focus of the discussion, though, which is arguing that if there is configuration that is surfaced in Kubelet Configuration, Bottlerocket should also support it to make the experience as seamless as possible for users.

stevehipwell commented 2 years ago

@jonathan-innis kube-reserved should be decoupled from max pods by default, it's only the AWS implementation which couples them together (incorrectly in my opinion); so implementing the GKE kube-reserved calculation makes this a viable option.

On top of that, I don't see a reason to add a max-pods-mode. Primarily because we can assume that the user is using ENI-limited max pods unless they have explicitly set the max-pods value to be 110 or some other value.

I think @bwagner5 highlighted the reason why you need this with the VPC CNI custom networking mode.

It feels like we can punt both the GKE calculation decision and the max-pods-mode decision in favor of just adding podsPerCore for now and just calculating the kube-reserved based off of pods which would be

I don't think these two features are linked in this way, they have interactions but aren't mutually exclusive as I described in https://github.com/bottlerocket-os/bottlerocket/issues/1721#issuecomment-1209254902. I see podsPerCore as constrained by max pods if both are set, and if GKE mode is in use it has no impact on kube reserved.

jonathan-innis commented 2 years ago

I think @bwagner5 highlighted the reason why you need this with the VPC CNI custom networking mode.

I don't think this is a reason to have a max-pods-mode, this is just another thing that should be taken into consideration when discovering the max-pods on the cluster.

I don't think these two features are linked in this way

Can you explain how they aren't linked in this way? We are currently doing the calculation based on pods so we should continue to do the calculation based on the max pods based on the min of the max-pods and pods-per core as described in this equation

pods = min(max-pods, pods-per-core * vCPUs)

I see podsPerCore as constrained by max pods if both are set, and if GKE mode is in use it has no impact on kube reserved.

Completely agree with this

stevehipwell commented 2 years ago

I think @bwagner5 highlighted the reason why you need this with the VPC CNI custom networking mode.

I don't think this is a reason to have a max-pods-mode, this is just another thing that should be taken into consideration when discovering the max-pods on the cluster.

@jonathan-innis I'm pretty sure I've heard this reason given for why the way AL2 currently works cannot be changed, if you can automate the discovery of the VPC CNI and it's specific mode (legacy, custom networking, IPv4 prefix, or IPv6) then I agree that this isn't needed.

I don't think these two features are linked in this way

Can you explain how they aren't linked in this way? We are currently doing the calculation based on pods so we should continue to do the calculation based on the max pods based on the min of the max-pods and pods-per core as described in this equation

Because the GKE calculation doesn't use max pods in it's algorithm, so adding the capability to set a max pods per core isn't related to using the GKE calculation. So what I'm saying is the implementing podsPerCore doesn't address the use case for adding the GKE algorithm either way.

In my opinion the GKE algorithm should be the highest priority here as it removes the requirement to couple max pods to kube reserved, which means when support for podsPerCore is added it can be tested in a way that only impacts one configuration component. Also unless you know differently the upstream kubelet configuration would generally assume the GKE algorithm (or something similar) was being used for kube reserved.

I am biased as it is the GKE kube reserved functionality that is currently blocking me as I don't want to have to figure out creating a custom configuration container to set kube reserved and then figure out how to get it to work with Karpenter.

stevehipwell commented 1 year ago

Has there been any more progress on this? Specifically towards an API design similar to in my comment above and copied here.

[settings.kubernetes]
"max-pods-mode" = "default"
"max-pods" = 110
"kube-reserved-memory-calc" = "system-memory"
"kube-reserved-cpu-calc" = "system-cpu"

[settings.kubernetes.kube-reserved]
"ephemeral-storage" = "1Gi"

stevehipwell commented 1 year ago

I see this has been removed from the backlog, is there a reason why?

@jonathan-innis I don't suppose you could point me in the right direction to override the kube-reserved at runtime?

stmcginnis commented 1 year ago

Sorry about the churn. We aren't using the backlog milestone anymore, so I removed it from there. This is still something that looks interesting. Just need a plan and someone that can take it forward.

stmcginnis commented 1 year ago

@dmitmasy another one for your awareness and prioritization.

stevehipwell commented 1 year ago

@stmcginnis is there an example of modifying the settings from within a bootstrap container which I could use as the basis for a temporary solution?

stmcginnis commented 1 year ago

Not that I know of. /etc/kubernetes/kubelet/config would need to be updated, so it would need to happen after the template for that config file has been rendered during start up. It could be edited and kubelet restarted to pick up the change.

I think settings.kubernetes.max-pods is probably the safest approach for now. Which would mean precalculating what you want that value to be and setting it in the user data so the kubelet configuration uses that rather than the default.

stevehipwell commented 1 year ago

@stmcginnis I hardcode max pods to 110 but want to customise my kube-reserved values. I was expecting to be able to do something like below in a bootstrap container?

apiclient set \
  kubernetes.kube-reserved.cpu="${kube_reserved_cpu}m" \
  kubernetes.kube-reserved.memory="${kube_reserved_memory}Mi"

As an aside and I might not be looking hard enough but I expected to find a container in the public ECR which would allow me to bootstrap Bottlerocket directly from userdata.

stmcginnis commented 1 year ago

You could set those values as well.

Typically these settings are provided at boot time in the user data, so there isn't a need to have a separate bootstrap container to make these apiclient calls. That is either provided in the "User Data" field in the AWS console, as a user-data.toml file in bare metal or aws ec2 cli, or as yaml values in an eksctl config file.

stevehipwell commented 1 year ago

@stmcginnis that's what I do for ASGs as the defaults are wrong enough to brick nodes. But with Karpenter the userdata needs to be generic as the instance can be any size. This issue tracks the actual details, but I'm currently trying to POC dynamically setting these as I can do for AL2.

stmcginnis commented 1 year ago

Yeah, in that case unfortunately you will have to go with the container route to call apiclient with the necessary settings. :/

stevehipwell commented 1 year ago

I think #2010 is relevant here.

olfway commented 11 months ago

@stevehipwell did you find a way to dynamically set cpu / memory for kube-reserved?

stevehipwell commented 11 months ago

@olfway I've been waiting for Karpenter to support alternative kube-reserved calculations. But AFAIK there is still no official ECR image with the correct entrypoint for this.

irizzant commented 1 month ago

Sorry for the silly question but just to make sure I uderstand correctly: does this issue mean that Bottlerocket with prefix delegation is not handling max-pods correctly? I have a situation in my EKS where I tried to add spec.kubelet.maxPods=110 to the NodePool, eg:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  amiSelectorTerms:
  - alias: bottlerocket@latest
  kubelet:
    maxPods: 110
  blockDeviceMappings:
  # Root device
  - deviceName: /dev/xvda
    ebs:
      volumeSize: 4Gi
      volumeType: gp3
      encrypted: true
  # Data device: Container resources such as images and logs
  - deviceName: /dev/xvdb
    ebs:
      volumeSize: 20Gi
      volumeType: gp3
      encrypted: true
  role: eks-management-karpenter-node-role
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: management
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: management
  tags:
    karpenter.sh/discovery: management

This way nodes are created with a huge amount of system reserved memory (40%) and they get packed up with Pods, which leads to Pod evictions.

stevehipwell commented 1 month ago

@irizzant unless something has changed the reserved values are tied to the arbitrary number of pods AWS says can run on a specific node type, and this uses the pre prefix logic where the instance type ENI count limits pod density. This also ignores the actual user defined max pods if set

This logic is incorrect for a number of reasons. Not respecting the user defined max pods makes zero sense. Prefix mode makes the pod density limitations disappear. K8s officially has a 110 max pods per node recommendation, and as all nodes under prefix mode can support 110 pods dynamically this should be the default.

Also as the per pod calculation was run on AL2 using Docker as the CRI a number of years ago it should be recalculated for Bottlerocket.

My recommendation would be to always explicitly set max pods and kube reserved. I make the reserved a function of max pods and have run tests using pods with various container counts to come up with a value.

irizzant commented 1 month ago

@stevehipwell thanks for the reply but I'm confused. As far as I know Karpenter already assigns a kube reserved amount of resources based on the instance type, which should be ok in theory: https://github.com/aws/karpenter-provider-aws/blob/f0dbb2689871d6da851a89090fb738b5476393af/pkg/providers/instancetype/types.go#L369C1-L402C2

Also EKS AMI has its own logic: https://github.com/awslabs/amazon-eks-ami/blob/4b54ee95d42df8a2715add2a32f5150db097fde8/files/bootstrap.sh#L247-L251

When I say max pod = 110 I expect reserved memory to be around 1.2 Gi following EKS AMI logic, which is a lot for small nodes.

Are you saying that when I add max pod = 110 Karpenter logic is overriden by EKS AMI? Or is it Karpenter logic that kicks in and ends up with a wrong calculation and this is why I have to add custom values?

stevehipwell commented 1 month ago

@irizzant I was specifically referring to the AMI logic, not the Karpenter logic. Last time I dug into the details Karpenter was mirroring this behaviour, but I've switched approach and haven't been following closely for a few months.

As Karpenter uses the fleet API I'm not sure it can set dynamic reserved values by instance type as it doesn't know the actual types in advance. I think any calculations it's doing are to mirror the AMI defaults.

When I'm back on a reasonable size screen I'll take a look at the logic in detail.

irizzant commented 1 month ago

@stevehipwell thanks, I've checked myself and it looks like you're right about AMI logic being the same as Karpenter's, it uses: memory_to_reserve=$((11 * $max_num_pods + 255))

which again it's too much for small nodes. So you're suggesting to add max pods = 110 and use a fixed value for kube reserved which is suitable for nodes over a certain size right?

Also, I may be wrong but I don't remember experiencing issues with Pod density before upgrading to Karpenter v1.

Maybe this issue is related ? https://github.com/aws/karpenter-provider-aws/issues/6890

I'm currently trying to test my above mentioned EC2NodeClass with Karpenter v0.37 to check if anything is different

irizzant commented 1 month ago

@stevehipwell I confirm that this is not releated to Karpenter v1, the previous version v0.37 gives the same result.

Given the absence of a way to set dynamic reserved values by instance type I suppose the only way to proceed is to set kube reserved values manually and adjust NodeClass to be big enough to tolerate the chosen values right?

stevehipwell commented 3 weeks ago

I think 1.22.0 introduced bootstrap commands (via #4131) which should allow dynamic reserved calculations without requiring a custom OCI image; but there doesn't appear to be any documentation for this yet.

irizzant commented 3 weeks ago

@stevehipwell thanks for the updates, as I wrote in https://github.com/awslabs/amazon-eks-ami/issues/1145#issuecomment-2393466340 I still don't understand how to pass dynamic values in Karpenter NodeClasses with bootstrap commands

stevehipwell commented 3 weeks ago

@irizzant I'm not sure if I fully understand your question, but to implement this via Karpenter you'd need a command that can do the dynamic work and then pass that in via user data.

irizzant commented 3 weeks ago

@stevehipwell let me try to explain. The bootstrap commands doc shows you can use apiclient to interact with Bottlerocket.

Is it possible to query the instance data (cpu, memory, ... ) using apiclient and then do a calculation based on these parameters?

If yes then I'd pass the calculation to Karpenter NodeClass using user data and we're good.

Otherwise can you please detail how to proceed?

stevehipwell commented 3 weeks ago

@irizzant that is something that I think you'd need to speak to the maintainers about. I'm not sure what data the apiclient has access to.

bottlerocket-os / bottlerocket