[EKS] Increased pod density on smaller instance types

tabern commented 5 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request All instance types using the VPC CNI plugin should support at least the Kubernetes recommended pods per node limits.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Today, the max number of pods that can run on worker nodes using the VPC CNI plugin is limited by the number of ENIs and secondary IPv4 addresses the instance supports. This number is lower if you are using CNI custom networking, which removes the primary ENI for use by pods. VPC CNI should support at least the Kubernetes recommended pods per node thresholds, regardless of networking mode. Not supporting these maximums means nodes may run out of IP addresses before CPU/memory is fully utilized.

Are you currently working around this issue? Using larger instance types, or adding more nodes to a cluster that aren't fully utilized.

Additional context Take the m5.2xlarge for example, which has 8 vCPUs. Based on Kubernetes recommended limits of pods per node of min(110, 10*#cores), this instance type should support 80 pods. However when using custom networking today, it only supports 44 pods.

Edit Feature is released: https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/

mikestef9 commented 3 years ago

No concerns with that, totally valid use case and is supported. The scheduler handles that. Prefixes are associated with "regular" EC2 ENIs, while pods requiring security groups are attached to separate branch ENIs.

zswanson commented 3 years ago

That's fantastic thank you, this helps a lot.

On Sun, Aug 8, 2021 at 2:48 PM Mike Stefaniak @.***> wrote:

No concerns with that, totally valid use case and is supported. The scheduler handles that. Prefix ENIs are associated with "regular" EC2 ENIs, while pods requiring security groups are attached to separate branch ENIs.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aws/containers-roadmap/issues/138#issuecomment-894838983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWY6VZGG7PF7JKAAOEI5LT33GQRANCNFSM4GTLZGUA .

mikestef9 commented 3 years ago

Update - Managed node groups now uses a server side version of this formula to automatically set the right max pods value, as long as you have upgraded to VPC CNI version 1.9. This helps for both prefix assignment use cases, as well as CNI custom networking where you previously needed to manually set a lower max pods value. The max pods value will be set on any newly created managed node groups, or node groups updated to a newer AMI version.

wolverian commented 3 years ago

Update - Managed node groups now uses a server side version of this formula to automatically set the right max pods value, as long as you have upgraded to VPC CNI version 1.9. This helps for both prefix assignment use cases, as well as CNI custom networking where you previously needed to manually set a lower max pods value. The max pods value will be set on any newly created managed node groups, or node groups updated to a newer AMI version.

That's awesome, thank you! 👍

Additional question, do we still need to enable this feature in the CNI add-on environment variables, like this page says: https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html

mikestef9 commented 3 years ago

Yes, you still need to enable it. We plan to enable it by default in a future release. Additionally, we've begun work on #1333 so you can enable it directly through EKS add-ons API in the future.

sstoyanovucsd commented 3 years ago

@mikestef9, is it possible to optionally take sig-scalability's defined thresholds into account and limit the max pods per node on a managed nodegroup to min(110, 10*#cores).

Reference: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md

mikestef9 commented 3 years ago

What problem are you trying to solve by having us change to that formula? You think 110 is too high for an instance type like m5.large? This feature is targeted at such users of m5.large where the previous limit of 29 was too low.

The max pods formula for MNG now is

<=30 vCPUs min(110, max IPs based on CNI settings) >30 vCPUs min(250, max IPs based on CNI settings)

This is based on internal testing done by our scalability team. However, it's impossible to simulate all possible combinations of real world workloads. As a best practice, you should be setting resource requests/limits on your pods. The point is IP address is no longer the limiting factor for pods per node when using prefix assignment.

sstoyanovucsd commented 3 years ago

I understand that this feature solves the issue with too few pods being allowed on a node that can potentially handle more. Depending on the type of workloads, the opposite may also be needed i.e. setting max pods on the node to less than the IP/ENI limit would impose. Setting maximums like the 110 and 250 is a good start, but it would be much better if it was a nodegroup setting that one can use to self-restrict nodes to a lower number.

We do set requests/limits per pod, but running at high pod densities leaves few resources to be shared by burstable workloads. For example, some Java apps need the extra resources buffer to scale up as opposed to out. When there's too many of these on a single node, memory pressure causes pods to get evicted from the node. While this is a normal behavior, the startup time of such pods is not the best so we'd rather prevent such occurrences as much as possible.

mikestef9 commented 3 years ago

Understood, that makes sense. Today, you can override the max pods setting when using managed node groups, but it requires extra effort. You need to use a launch template, specify the EKS AMI ID as the "custom" image ID in the LT, then manually add the bootstrap script in user data, like

#!/bin/bash
set -ex
/etc/eks/bootstrap.sh my-cluster --kubelet-extra-args "--max-pods=25"

I think it's a valid feature request to expose max pods directly through the MNG API, can you open a separate containers roadmap issue with that request?

Side note - this will be much easier with native Bottlerocket support in managed node groups #950, which is coming soon. You'll simply need to add the following in the launch template user data (no need to set the image ID in LT)

[settings.kubernetes]
max-pods = 25

sstoyanovucsd commented 3 years ago

Request submitted: https://github.com/aws/containers-roadmap/issues/1492

Thanks!

stevehipwell commented 3 years ago

@sstoyanovucsd terraform-aws-modules/terraform-aws-eks has a working pattern that doesn't require a custom AMI and terraform-aws-modules/terraform-aws-eks#1433 shows how to optimise this as well as set other bootstrap.sh options.

mikestef9 commented 3 years ago

Blog is out that dives into this feature in more detail

https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/

gpothier commented 3 years ago

Thanks @mikestef9 ! Quick question: how do I troubleshoot the Managed Node Group not updating the max pods per node configuration? I have the 1.9.0 CNI plugin (through the addon), I added the ENABLE_PREFIX_DELEGATION and WARM_PREFIX_TARGET values to the aws-node DaemonSet, and I deleted and recreated the MNG, but by max pods per node is still 17 (on t3.medium instances).

gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.9.0-eksbuild.1
amazon-k8s-cni:v1.9.0-eksbuild.1
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset -n kube-system aws-node | grep ENABLE_PREFIX_DELEGATION
      ENABLE_PREFIX_DELEGATION:            true
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset -n kube-system aws-node | grep WARM_PREFIX_TARGET
      WARM_PREFIX_TARGET:                  1
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe node |grep pods
  pods:                        17
  pods:                        17
  Normal  NodeAllocatableEnforced  24m                kubelet     Updated Node Allocatable limit across pods
  pods:                        17
  pods:                        17
  Normal  NodeAllocatableEnforced  24m                kubelet     Updated Node Allocatable limit across pods

stevehipwell commented 3 years ago

@gpothier have you updated the kubelet args to override the defaults?

gpothier commented 3 years ago

@stevehipwell No I haven't, but according to the blog post @mikestef9 linked, the MNG should take care of that:

As part of this launch, we’ve updated EKS managed node groups to automatically calculate and set the recommended max pod value based on instance type and VPC CNI configuration values, as long as you are using at least VPC CNI version 1.9

Or did I misunderstand something?

stevehipwell commented 3 years ago

@gpothier sorry I hadn't read the blog post, I'll leave this one to @mikestef9.

stevehipwell commented 3 years ago

@mikestef9 what happens when we're using custom networking and ENI prefixes with the official AMI? We manually set USE_MAX_PODS=false in the env and add --max-pods to KUBELET_EXTRA_ARGS to both be picked up by bootstrap.sh.

stevehipwell commented 3 years ago

@mikestef9 could you also confirm that the other EKS ecosystem components work correctly with ENABLE_PREFIX_DELEGATION set, I'm specifically thinking of the aws-load-balancer-controller but it'd be good to know that NTH and the CSI drivers have all been tested and work correctly.

thanhma commented 3 years ago

@stevehipwell I tested AWS Load Balancer Controller v2.2 on ENABLE_PREFIX_DELEGATION enabled cluster haven't seen any problem yet.

mikestef9 commented 3 years ago

Support for prefix delegation was in v2.2.2 of LB controller

https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.2.2

@gpothier are you specifying an image id in a launch template used with the managed node group?

@stevehipwell all of the VPC CNI settings that may affect max pods are taken into account, including custom networking

gpothier commented 3 years ago

@mikestef9 I didn't create the launch template explicitly, so I didn't specify an image id myself, but the launch template does exist and its image id is ami-0bb07d9c8d6ca41e8. The cluster and node group were created by terraform, using the terraform-aws-eks module.

mikestef9 commented 3 years ago

I'm not very familiar with the Terraform EKS module. But if it is creating a launch template and specifying an image id (even if it's the official EKS AMI image id), that's considered a custom AMI to managed node groups, and the max pods override won't be set.

stevehipwell commented 3 years ago

I'm not very familiar with the Terraform EKS module. But if it is creating a launch template and specifying an image id (even if it's the official EKS AMI image id), that's considered a custom AMI to managed node groups, and the max pods override won't be set.

Thanks @mikestef9 this is actually the answer I needed to my above question.

gpothier commented 3 years ago

Thanks a lot @mikestef9. As far as I can tell, the launch templates were created by the MNG, not by terraform. The node_groups submodule of the terraform-aws-eks module has the create_launch_template option set to false by default (and I do not override it). And I checked that there is no mention of the node groups' launch templates in the terraform state (the ones that appear here are used by the NAT gateways):

gpothier@tadzim4:~/ownCloud-Caligrafix/dev/ecaligrafix/infrastructure $ terraform-1.0.3 state list |grep launch_template
aws_launch_template.nat_gateway_template[0]
aws_launch_template.nat_gateway_template[1]
gpothier@tadzim4:~/ownCloud-Caligrafix/dev/ecaligrafix/infrastructure $

Also, in the AWS console, the node groups' launch templates appear to have been created by the MNG: the Created by field says "arn:aws:sts::015328124252:assumed-role/AWSServiceRoleForAmazonEKSNodegroup/EKS".

gpothier commented 3 years ago

Hi @mikestef9, do you think you could give me a pointer on how to troubleshoot the Managed Node Group not updating the max pods per node configuration? Given that as far as I can tell I meet all the requirements, in particular the launch template is the one created by the MNG so I don't have control over it, I am a bit at a loss.

mikestef9 commented 3 years ago

Do you have multiple instance types specified in the managed node group? If so, MNG uses the min value calculated for all instance types. So if you have a non nitro instance like m4.2xlarge for example, the node group will use 58 as the max pod value.

gpothier commented 3 years ago

Thanks @mikestef9 that was it! Although all the existing instances were indeed Nitro (t3.medium), the allowed instances included non-nitro ones. I recreated the MNG allowing only t3.medium and t3.small instances and the pod limit is now 110.

This raises a question though: shouldn't the max pods per node property be set independently for each node, according the the node's capacity?

mikestef9 commented 3 years ago

Glad to hear it. Managed node groups must specify the max pods value as part of the launch template that we create behind the scenes for each node group. That launch template is associated with an autoscaling group that we also create. The autoscaling group gets assigned the list of desired instance types, but there is no way to know ahead of time which instance type the ASG will spin up. So to be safe, we pick the lowest value of all instance types in the list.

lwimmer commented 3 years ago

Wouldn't it be much better to determine the max pod value during bootstrapping of the node (i.e. in the bootstrap.sh)?

In this case it would work with different node types, because each node type could get the appropriate max pod value.

mikestef9 commented 3 years ago

The recommended value of max pods is a function of instance type and the version/configuration of VPC CNI running on the cluster. At the time of bootstrapping, we don't have a way to determine the latter. We can't make a call to the API server (k get ds aws-node) and retrieve the CNI version/settings because calls from there will not be authenticated until the aws-auth config map is updated first.

lwimmer commented 3 years ago

The recommended value of max pods is a function of instance type and the version/configuration of VPC CNI running on the cluster. At the time of bootstrapping, we don't have a way to determine the latter. We can't make a call to the API server (k get ds aws-node) and retrieve the CNI version because calls from there will not be authenticated until the aws-auth CM is updated first.

I see. Thank you for the explanation.

stevehipwell commented 3 years ago

@mikestef9 it looks like the AMI bootstrap hasn't been updated to work correctly with this change and if used on a small instance could cause resource issues for kubelet.

awslabs/amazon-eks-ami#782

stevehipwell commented 3 years ago

@mikestef9 related to my comment above, how come EKS has decided to go over the K8s large clusters guide recommendation of a maximum 110 pods per node?

aws / containers-roadmap

[EKS] Increased pod density on smaller instance types #138

Community Note