Closed tabern closed 3 years ago
No concerns with that, totally valid use case and is supported. The scheduler handles that. Prefixes are associated with "regular" EC2 ENIs, while pods requiring security groups are attached to separate branch ENIs.
That's fantastic thank you, this helps a lot.
On Sun, Aug 8, 2021 at 2:48 PM Mike Stefaniak @.***> wrote:
No concerns with that, totally valid use case and is supported. The scheduler handles that. Prefix ENIs are associated with "regular" EC2 ENIs, while pods requiring security groups are attached to separate branch ENIs.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aws/containers-roadmap/issues/138#issuecomment-894838983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWY6VZGG7PF7JKAAOEI5LT33GQRANCNFSM4GTLZGUA .
Update - Managed node groups now uses a server side version of this formula to automatically set the right max pods value, as long as you have upgraded to VPC CNI version 1.9. This helps for both prefix assignment use cases, as well as CNI custom networking where you previously needed to manually set a lower max pods value. The max pods value will be set on any newly created managed node groups, or node groups updated to a newer AMI version.
Update - Managed node groups now uses a server side version of this formula to automatically set the right max pods value, as long as you have upgraded to VPC CNI version 1.9. This helps for both prefix assignment use cases, as well as CNI custom networking where you previously needed to manually set a lower max pods value. The max pods value will be set on any newly created managed node groups, or node groups updated to a newer AMI version.
That's awesome, thank you! 👍
Additional question, do we still need to enable this feature in the CNI add-on environment variables, like this page says: https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html
Yes, you still need to enable it. We plan to enable it by default in a future release. Additionally, we've begun work on #1333 so you can enable it directly through EKS add-ons API in the future.
@mikestef9, is it possible to optionally take sig-scalability's defined thresholds into account and limit the max pods per node on a managed nodegroup to min(110, 10*#cores).
Reference: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md
What problem are you trying to solve by having us change to that formula? You think 110 is too high for an instance type like m5.large? This feature is targeted at such users of m5.large where the previous limit of 29 was too low.
The max pods formula for MNG now is
<=30 vCPUs min(110, max IPs based on CNI settings)
>30 vCPUs min(250, max IPs based on CNI settings)
This is based on internal testing done by our scalability team. However, it's impossible to simulate all possible combinations of real world workloads. As a best practice, you should be setting resource requests/limits on your pods. The point is IP address is no longer the limiting factor for pods per node when using prefix assignment.
I understand that this feature solves the issue with too few pods being allowed on a node that can potentially handle more. Depending on the type of workloads, the opposite may also be needed i.e. setting max pods on the node to less than the IP/ENI limit would impose. Setting maximums like the 110 and 250 is a good start, but it would be much better if it was a nodegroup setting that one can use to self-restrict nodes to a lower number.
We do set requests/limits per pod, but running at high pod densities leaves few resources to be shared by burstable workloads. For example, some Java apps need the extra resources buffer to scale up as opposed to out. When there's too many of these on a single node, memory pressure causes pods to get evicted from the node. While this is a normal behavior, the startup time of such pods is not the best so we'd rather prevent such occurrences as much as possible.
Understood, that makes sense. Today, you can override the max pods setting when using managed node groups, but it requires extra effort. You need to use a launch template, specify the EKS AMI ID as the "custom" image ID in the LT, then manually add the bootstrap script in user data, like
#!/bin/bash
set -ex
/etc/eks/bootstrap.sh my-cluster --kubelet-extra-args "--max-pods=25"
I think it's a valid feature request to expose max pods directly through the MNG API, can you open a separate containers roadmap issue with that request?
Side note - this will be much easier with native Bottlerocket support in managed node groups #950, which is coming soon. You'll simply need to add the following in the launch template user data (no need to set the image ID in LT)
[settings.kubernetes]
max-pods = 25
Request submitted: https://github.com/aws/containers-roadmap/issues/1492
Thanks!
@sstoyanovucsd terraform-aws-modules/terraform-aws-eks has a working pattern that doesn't require a custom AMI and terraform-aws-modules/terraform-aws-eks#1433 shows how to optimise this as well as set other bootstrap.sh options.
Blog is out that dives into this feature in more detail
https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/
Thanks @mikestef9 ! Quick question: how do I troubleshoot the Managed Node Group not updating the max pods per node configuration? I have the 1.9.0 CNI plugin (through the addon), I added the ENABLE_PREFIX_DELEGATION and WARM_PREFIX_TARGET values to the aws-node DaemonSet, and I deleted and recreated the MNG, but by max pods per node is still 17 (on t3.medium instances).
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.9.0-eksbuild.1
amazon-k8s-cni:v1.9.0-eksbuild.1
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset -n kube-system aws-node | grep ENABLE_PREFIX_DELEGATION
ENABLE_PREFIX_DELEGATION: true
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset -n kube-system aws-node | grep WARM_PREFIX_TARGET
WARM_PREFIX_TARGET: 1
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe node |grep pods
pods: 17
pods: 17
Normal NodeAllocatableEnforced 24m kubelet Updated Node Allocatable limit across pods
pods: 17
pods: 17
Normal NodeAllocatableEnforced 24m kubelet Updated Node Allocatable limit across pods
@gpothier have you updated the kubelet args to override the defaults?
@stevehipwell No I haven't, but according to the blog post @mikestef9 linked, the MNG should take care of that:
As part of this launch, we’ve updated EKS managed node groups to automatically calculate and set the recommended max pod value based on instance type and VPC CNI configuration values, as long as you are using at least VPC CNI version 1.9
Or did I misunderstand something?
@gpothier sorry I hadn't read the blog post, I'll leave this one to @mikestef9.
@mikestef9 what happens when we're using custom networking and ENI prefixes with the official AMI? We manually set USE_MAX_PODS=false
in the env and add --max-pods
to KUBELET_EXTRA_ARGS
to both be picked up by bootstrap.sh.
@mikestef9 could you also confirm that the other EKS ecosystem components work correctly with ENABLE_PREFIX_DELEGATION
set, I'm specifically thinking of the aws-load-balancer-controller but it'd be good to know that NTH and the CSI drivers have all been tested and work correctly.
@stevehipwell I tested AWS Load Balancer Controller v2.2 on ENABLE_PREFIX_DELEGATION
enabled cluster haven't seen any problem yet.
Support for prefix delegation was in v2.2.2 of LB controller
https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.2.2
@gpothier are you specifying an image id in a launch template used with the managed node group?
@stevehipwell all of the VPC CNI settings that may affect max pods are taken into account, including custom networking
@mikestef9 I didn't create the launch template explicitly, so I didn't specify an image id myself, but the launch template does exist and its image id is ami-0bb07d9c8d6ca41e8. The cluster and node group were created by terraform, using the terraform-aws-eks module.
I'm not very familiar with the Terraform EKS module. But if it is creating a launch template and specifying an image id (even if it's the official EKS AMI image id), that's considered a custom AMI to managed node groups, and the max pods override won't be set.
I'm not very familiar with the Terraform EKS module. But if it is creating a launch template and specifying an image id (even if it's the official EKS AMI image id), that's considered a custom AMI to managed node groups, and the max pods override won't be set.
Thanks @mikestef9 this is actually the answer I needed to my above question.
Thanks a lot @mikestef9. As far as I can tell, the launch templates were created by the MNG, not by terraform. The node_groups
submodule of the terraform-aws-eks
module has the create_launch_template
option set to false by default (and I do not override it). And I checked that there is no mention of the node groups' launch templates in the terraform state (the ones that appear here are used by the NAT gateways):
gpothier@tadzim4:~/ownCloud-Caligrafix/dev/ecaligrafix/infrastructure $ terraform-1.0.3 state list |grep launch_template
aws_launch_template.nat_gateway_template[0]
aws_launch_template.nat_gateway_template[1]
gpothier@tadzim4:~/ownCloud-Caligrafix/dev/ecaligrafix/infrastructure $
Also, in the AWS console, the node groups' launch templates appear to have been created by the MNG: the Created by field says "arn:aws:sts::015328124252:assumed-role/AWSServiceRoleForAmazonEKSNodegroup/EKS".
Hi @mikestef9, do you think you could give me a pointer on how to troubleshoot the Managed Node Group not updating the max pods per node configuration? Given that as far as I can tell I meet all the requirements, in particular the launch template is the one created by the MNG so I don't have control over it, I am a bit at a loss.
Do you have multiple instance types specified in the managed node group? If so, MNG uses the min value calculated for all instance types. So if you have a non nitro instance like m4.2xlarge for example, the node group will use 58 as the max pod value.
Thanks @mikestef9 that was it! Although all the existing instances were indeed Nitro (t3.medium), the allowed instances included non-nitro ones. I recreated the MNG allowing only t3.medium and t3.small instances and the pod limit is now 110.
This raises a question though: shouldn't the max pods per node property be set independently for each node, according the the node's capacity?
Glad to hear it. Managed node groups must specify the max pods value as part of the launch template that we create behind the scenes for each node group. That launch template is associated with an autoscaling group that we also create. The autoscaling group gets assigned the list of desired instance types, but there is no way to know ahead of time which instance type the ASG will spin up. So to be safe, we pick the lowest value of all instance types in the list.
Wouldn't it be much better to determine the max pod value during bootstrapping of the node (i.e. in the bootstrap.sh)?
In this case it would work with different node types, because each node type could get the appropriate max pod value.
The recommended value of max pods is a function of instance type and the version/configuration of VPC CNI running on the cluster. At the time of bootstrapping, we don't have a way to determine the latter. We can't make a call to the API server (k get ds aws-node) and retrieve the CNI version/settings because calls from there will not be authenticated until the aws-auth config map is updated first.
The recommended value of max pods is a function of instance type and the version/configuration of VPC CNI running on the cluster. At the time of bootstrapping, we don't have a way to determine the latter. We can't make a call to the API server (k get ds aws-node) and retrieve the CNI version because calls from there will not be authenticated until the aws-auth CM is updated first.
I see. Thank you for the explanation.
@mikestef9 it looks like the AMI bootstrap hasn't been updated to work correctly with this change and if used on a small instance could cause resource issues for kubelet.
awslabs/amazon-eks-ami#782
@mikestef9 related to my comment above, how come EKS has decided to go over the K8s large clusters guide recommendation of a maximum 110 pods per node?
Community Note
Tell us about your request All instance types using the VPC CNI plugin should support at least the Kubernetes recommended pods per node limits.
Which service(s) is this request for? EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Today, the max number of pods that can run on worker nodes using the VPC CNI plugin is limited by the number of ENIs and secondary IPv4 addresses the instance supports. This number is lower if you are using CNI custom networking, which removes the primary ENI for use by pods. VPC CNI should support at least the Kubernetes recommended pods per node thresholds, regardless of networking mode. Not supporting these maximums means nodes may run out of IP addresses before CPU/memory is fully utilized.
Are you currently working around this issue? Using larger instance types, or adding more nodes to a cluster that aren't fully utilized.
Additional context Take the m5.2xlarge for example, which has 8 vCPUs. Based on Kubernetes recommended limits of pods per node of min(110, 10*#cores), this instance type should support 80 pods. However when using custom networking today, it only supports 44 pods.
Edit Feature is released: https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/