aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 320 forks source link

[EKS] Increased pod density on smaller instance types #138

Closed tabern closed 3 years ago

tabern commented 5 years ago

Community Note

Tell us about your request All instance types using the VPC CNI plugin should support at least the Kubernetes recommended pods per node limits.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Today, the max number of pods that can run on worker nodes using the VPC CNI plugin is limited by the number of ENIs and secondary IPv4 addresses the instance supports. This number is lower if you are using CNI custom networking, which removes the primary ENI for use by pods. VPC CNI should support at least the Kubernetes recommended pods per node thresholds, regardless of networking mode. Not supporting these maximums means nodes may run out of IP addresses before CPU/memory is fully utilized.

Are you currently working around this issue? Using larger instance types, or adding more nodes to a cluster that aren't fully utilized.

Additional context Take the m5.2xlarge for example, which has 8 vCPUs. Based on Kubernetes recommended limits of pods per node of min(110, 10*#cores), this instance type should support 80 pods. However when using custom networking today, it only supports 44 pods.

Edit Feature is released: https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/

ghost commented 5 years ago

@tabern could you please elaborate a bit what this feature brings?

Right now the number of pods on a single node is limited by --max-pod flag in kubelet, which for EKS is calculated based on the max number of IP addresses instance can have. This comes from AWS CNI driver logic to provide an IP-address per pod from VPC subnet. So for r4.16xl it is 737 pods.

max-rocket-internet commented 5 years ago

which for EKS is calculated based on the max number of IP addresses instance can have

That's exactly the problem. What if we want to run 30 very small pods on a t.small?

ghost commented 5 years ago

@max-rocket-internet gotcha. Does it mean instances will get more IPs/ENAs or changes are coming to CNI?

max-rocket-internet commented 5 years ago

It means we need to run a different CNI that is not limited by the number of IPs. Currently is more or less a DIY endeavour but it would be great to have a supported CNI from AWS for this use 🙂

laverya commented 5 years ago

Yeah, running weave-net (and overriding the pods-per-node limitations) isn't much of an additional maintenance burden but it would have been nice to have that available by default.

lgg42 commented 5 years ago

Any idea how exactly are you going to proceed with this one? Seems very much alike to #71

tabern commented 5 years ago

Sorry it's been a bit of time with out a lot of information. We're committed to enabling this feature and will be wrapping this into the next generation VPC CNI plugin.

Please let us know what you think on https://github.com/aws/containers-roadmap/issues/398

gitnik commented 3 years ago

The comment by @mikestef9 on #398 refers to this issue for updates regarding the specific issue of pod-density. Since there has been no update on this issue in over a year, could someone from the EKS team give us an update?

mikestef9 commented 3 years ago

We are working on integrating with an upcoming VPC feature that will allow many more IP addresses to be attached per instance type. For example, a t3.medium will go from allowing 15 IPs per instance, to 240, a 1500% increase. No timeline to share, but it is a high priority for the team.

bambooiris commented 3 years ago

@mikestef9 hi! Will be pod density increased for bigger instances types as well? This is very important because we are thinking to switch to a different CNI plugin, but if you will increase the IP addresses count any time soon we will stay with AWS CNI :)

mikestef9 commented 3 years ago

It will be a 1500% increase in IP addresses on every instance type. However, I don't feel that matters on larger instance types. For example, a c5.4xl today supports 234 IP addresses for pods. Which particular instance type are you using?

bambooiris commented 3 years ago

We are using m5.xlarge and still have enough resources to schedule additional pods but we out of free IPs.

mikestef9 commented 3 years ago

Got it. I'm consider "smaller" to mean any instance type 2xl and below. In this case, m5.xlarge will go from supporting 56 IPs to 896, which will be more than enough to consume all instance resources by pods.

billinghamj commented 3 years ago

Pods can be very very small 😉 But nevertheless, this is a great step

billinghamj commented 3 years ago

To just get clarity, this is 16x the IPs while still using IPv4? Whereas longer term, for huge numbers of IPs etc, it's expected that EKS will shift to IPv6 instead?

mikestef9 commented 3 years ago

Exactly. The same upcoming EC2/VPC feature that will allow us to increase IPv4s per instance, will also allow us to allocate a /80 IPv6 address block per instance. That's what we will leverage for IPv6 support, which is a top priority for us in 2021.

davidroth commented 3 years ago

We are working on integrating with an upcoming VPC feature that will allow many more IP addresses to be attached per instance type. For example, a t3.medium will go from allowing 15 IPs per instance, to 240, a 1500% increase. No timeline to share, but it is a high priority for the team.

@mikestef9 Sounds awesome. I'm currently evaluating EKS and the current pod limitation is a blocker for our workload. Could you please share an approximate release date? Thanks.

Z3R6 commented 3 years ago

Exactly. The same upcoming EC2/VPC feature that will allow us to increase IPv4s per instance, will also allow us to allocate a /80 IPv6 address block per instance. That's what we will leverage for IPv6 support, which is a top priority for us in 2021.

I'm currently evaluating EKS and the current pod limitation is a blocker for our workload. Could you please share an approximate release date? Thanks.

forsberg commented 3 years ago

Unfortunately, there are few things in life more certain than the fact that AWS will never ever ever ever ever ever share an approximate release date for a future feature. I'm pretty sure pigs will fly and the Universe will cease to exist before we see it happen.

billinghamj commented 3 years ago

It would be very helpful to at least get an idea of magnitude - are we talking weeks/months/quarters/years?

jwenz723 commented 3 years ago

@mikestef9 Can you please clarify for us if this feature will require using IPv6 in order to get the 1500% increase in IPs per node? I'm really hoping that I can continue to use IPv4 and get this pod density increase.

mikestef9 commented 3 years ago

The 1500% increase is for IPv4. With IPv6, you'll get a /80 per node, so pod density is no longer an issue for any use case.

dearsaturn commented 3 years ago

I come to this page every few months as a religious procedure

rohith-mr-rao commented 3 years ago

Issue Summary: EKS worker nodes has limitations for running number of pod based on the instance type. In our case we use instance type-c5.2xlarge which allows 58 pods per instance. If the pods are scheduled more than that number on the same instance, then those pods are getting stuck in ContainerCreating status. Describing the pods gives us following errors: " Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8be3026cc6bb2b311570f118dfef8ab93ae491e6d6fc20e12a46e9b814cff716" network for pod "pod-name": networkPlugin cni failed to set up pod "pod-name" network: add cmd: failed to assign an IP address to container "

As part of solution we tried adding --use-max-pods false --kubelet-extra-args '--max-pods=110' into our EKS worker node bootstrap script. Still the issue remains same. Please guide us the solutions if there are any to remove pods restriction based on instance type and run max number of pods in a worker node.

Note: Based on the AWS premium support they suggest us to consider greater instance types to achieve our use-case(which is to run max number of pods in a worker node). This solution provided by AWS is absolutely weird in many ways as listed below:

Z3R6 commented 3 years ago

F

stevehipwell commented 3 years ago

@rohith-mr-rao with the aws-vpc-cni each container needs an IP address on a node ENI. If you're NOT using a secondary subnet for your pods the number of pods per instance type can be found in the EKS AMI repo and this is automatically set for kubelet. If you ARE using a secondary subnet for your pods you will need to calculate the limit yourself (the formula maxPods = (number of interfaces - 1) * (max IPv4 addresses per interface - 1) + 2 is in the EKS docs) to pass to kubelet --max-pods and you should set the --use-max-pods false bootstrap argument so the default value isn't persisted in the config (although in practice this has no impact).

thanhma commented 3 years ago

Current workaround is to use third-party CNI plugins, I used Calico and can easily change the max-pod-per-nodes to whatever I want and utilized the host efficiently, but keep your eyes on reserved resources for kubelet and the node itself.

Very happy to see the status changed to Coming Soon after 2 years, and I hope we can soon try new native CNI to test the performance in large clusters.

tsndqst commented 3 years ago

Will this change increase the density when ENABLE_POD_ENI is true or only when it's false? Currently there are very few pod ENIs on each host.

davidroth commented 3 years ago

@mikestef9 If only we could get an estimated release date. Its on "Coming Soon" since 2 months, but no idea if this means we need to wait another 2 months, 6 months, 1 year?

Vlaaaaaaad commented 3 years ago

@davidroth the definition of "Coming Soon" is in the FAQ on this repository:

Q: What do the roadmap categories mean?

Just shipped - obvious, right? Coming soon - coming up. Think a couple of months out, give or take. We're working on it - in progress, but further out. We might still be working through the implementation details, or scoping stuff out. Researching - We're thinking about it. This might mean we're still designing, or thinking through how this might work. This is a great phase to send how you want to see something implemented! We'd love to see your usecase or design ideas here.

Based on that, the release could be in 2 weeks, or it could be in 4 months. Software development is weird like that. AWS specifically never shares release dates; you never know if you'll discover a massive issue 2 days before release. "You can now run a lot more pods, but we drop 50% of network packets" is not something anybody wants.

There's also KubeCon+CloudNativeCon in October (and re:Invent in December, but that feels like a stretch).

adelwin commented 3 years ago

I'd like to say that some use-case is not necessarily for "increasing" the pod density. My use-case is actually to "reduce" the pod density. We are running in an environment where VPC(and therefor IP) are shared from a centralized network account. And already sliced upfront to the number of accounts using the VPC. And this led to an unfortunate situation where we actually "ran out of IP".

So my use-case is to actually reduce the pod density to about 30-40 pods per nodes.

Solving this issue by setting the max pods in an obscure file inside the AMI seems to be bordeline "hard-coding". I've managed so solve the issue temporarily by using launch templates as such.

data "template_file" "launch_template_userdata" {
  template = file("${path.module}/templates/userdata.sh.tpl")

  vars = {
    max_pods  = 30
  }
}

resource "aws_launch_template" "default" {
  name_prefix             = "lt-${local.eks_name}-"
  update_default_version  = true

  user_data = base64encode(
    data.template_file.launch_template_userdata.rendered,
  )
...
}
# userdata.sh.tpl
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -e
sed -i -E "s/^USE_MAX_PODS=\"\\$\{USE_MAX_PODS:-true}\"/USE_MAX_PODS=false/" /etc/eks/bootstrap.sh
KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq ".maxPods=$max_pods" $KUBELET_CONFIG)" > $KUBELET_CONFIG
--//--

Now where did that snippet comes from? from eksctl code under nodebootstrap/managed_al2.go, adapted to Terraform. So it's essentially, patching the kubelet.json during bootstrap

It'd be nice to have either the provider of modules from AWS does it with simplicity under node_groups variables perhaps?

stevehipwell commented 3 years ago

Is this having moved to coming soon related to IP prefixes?

stevehipwell commented 3 years ago

It looks like it is via a new ENABLE_PREFIX_DELEGATION option in the aws-vpc-cni. Based on the docs it looks like this should work with custom networking too.

mikestef9 commented 3 years ago

Hey all,

You can run significantly more pods on AWS Nitro based instance types by upgrading to VPC CNI v1.9 and leveraging the integration with the recently released EC2 prefix assignment feature. An additional benefit of this feature is improved pod launch times, especially in clusters with high pod churn or lots of nodes, as fewer EC2 API calls to allocate network interfaces are required to support IP addresses for pods.

We expect to enable prefix assignment by default in a future release of the VPC CNI plugin. For now, you will need to set the variable ENABLE_PREFIX_DELEGATION to true and set a value for WARM_PREFIX_TARGET. More details can be found in the EKS documentation and VPC CNI GitHub repo.

For backwards compatibility reasons, the default max-pods value in the file built into the EKS AMI is not changing. For users with self managed nodes, we have included a helper script in the EKS AMI repo to help calculate the right max pods value based on instance type, CNI version, and CNI settings. Managed node groups now uses a server side version of this formula to automatically set the right max pods value, as long as you have upgraded to VPC CNI version 1.9. This helps for both prefix assignment use cases, as well as CNI custom networking where you previously needed to manually set a lower max pods value.

tsndqst commented 3 years ago

@mikestef9 Can you comment on my earlier question? Based on the description and CNI code I'm guessing the number of IP addresses when using POD ENI will not increase with this change.

Will this change increase the density when ENABLE_POD_ENI is true or only when it's false? Currently there are very few pod ENIs on each host.

mikestef9 commented 3 years ago

ENABLE_POD_ENI is for a totally separate networking mode not related to this launch, ie each pod gets its own (branch) network interface, versus prefix assignment, where each pods get an IP from a prefix assigned to an interface. The number of branch network interfaces available per instance is advertised as a Kubernetes extended resource, and does not change as a part of this release. There are essentially 3 distinct networking modes now with the VPC CNI plugin (in order of when they were launched):

  1. Pod gets an ENI secondary IP address
  2. Pod gets a dedicated branch network interface
  3. Pod gets an ENI prefix IP address

We are working on a containers blog post to dive into these networking modes in more detail, and use cases where each can be used.

stevehipwell commented 3 years ago

@mikestef9 could you confirm that ENABLE_PREFIX_DELEGATION will work correctly with custom networking?

zswanson commented 3 years ago

Am I reading this correctly that the new prefix feature does NOT allow use of security group on pods?

z0rc commented 3 years ago

@mikestef9 any way to manage vpc-cni configuration aside from editing env vars via kubectl? This approach isn't declarative, it isn't possible to set this var on cluster creation and can be accidentally overridden on addon upgrade.

stevehipwell commented 3 years ago

@z0rc I'd suggest using the official Helm chart.

z0rc commented 3 years ago

@stevehipwell AFAIK using helm chart introduces its own challenge where I need to delete or properly annotate existing deployment prior to chart installation.

stevehipwell commented 3 years ago

@z0rc that's what we do and again my recommended solution but there is the originalMatchLabels value to enable Helm to adopt the existing resources.

mikestef9 commented 3 years ago

Yes, prefix delegation works with CNI custom networking enabled.

jwenz723 commented 3 years ago

@mikestef9 does enabling CNI custom networking reduce the number of IPs available per instance when using prefix delegation?

I believe 110 IPs are available to an m5.large instance when using prefix delegation without CNI custom networking. How many IPs would be available if CNI custom networking was enabled?

chandrakanthkannam commented 3 years ago

@mikestef9 this might be a basic question, lets say I have a m5.large worker node deployed in a subnet whose CIDR range is x.x.x.x/26 then irrespective of the instance type (even for larger instance) I will be only able to attach 4 (/28) prefixes correct?

to make most of this implementation it's better to have larger CIDR ranges for the subnet where worker nodes deploy. Am I understanding it in the right direction?

mikestef9 commented 3 years ago

@jwenz723 take a look at the helper script we recently added to the EKS AMI repo. Previously with custom networking, an m5.large node could only have max 20 pods.

./max-pods-calculator.sh --instance-type m5.large --cni-version '1.9.0' --cni-custom-networking-enabled
20

With prefix delegation, you still get one fewer ENI for pods, however, pod density is much higher because you can attach a /28 prefix to each slot on the ENI, instead of a /32 individual IP.

./max-pods-calculator.sh --instance-type m5.large --cni-version '1.9.0' --cni-custom-networking-enabled --cni-prefix-delegation-enabled
110

In this case, you can actually attach 288 IPs for use by pods to the instance (2 ENIs * (16 IPs per 9 slots)), so you could override max pods to up to 288 if you wanted to. However, the script is following best practices and capping this number at 110 for a smaller instance type.

@chandrakanthkannam correct, prefix assignment is not a solution for VPC private IPv4 space exhaustion, as the prefixes are still pulled from your VPC subnets. By default, that means the worker node subnet. If using CNI custom networking, you can specify a separate subnet from the worker node.

stevehipwell commented 3 years ago

@jwenz723 depending on your daemonset usage you might be able to add a couple of IPs back from the host network.

zswanson commented 3 years ago

@mikestef9 can you clarify whether the new vpc prefix solution is compatible with Pod Security Groups?

mikestef9 commented 3 years ago

Depends what you mean by compatible. As I explained above, pod security groups leverages a separate networking mode, where each pod gets its own dedicated branch network interface.

Can pods getting branch interfaces/security groups co-exist with pods getting ENI prefix IP addresses in the same cluster and even on the same node? Yes

Is the prefix assignment launch related to the pod security groups feature at all, and does prefix assignment help increase pod density if you are only running pods with branch network interfaces/dedicated security groups? No

Again, we are working on a blog to dive into this in more detail. You need to choose your networking mode/strategy based on your use case and requirements. If pod launch time and pod density on smaller instance types is important to you, and you can work with node level security groups, then use prefix assignment.

If you have security requirements where pods need a specific set of security groups, then use a SecurityGroupPolicy and pods will be allocated dedicated branch network interfaces with those security groups applied. The max pods calculator script is not relevant for pod security groups, because that number per node is instead limited by using Kubernetes extended resources, where the number of branch network interfaces is advertised as an extended resource, and a pod that requires a branch interface is injected by a webhook for a resource request for a branch network interface.

At the moment, there is no option to get the best of both worlds, ie pod level security groups with very high density and fast pod launch time. One potential idea is described in #1342, which we are researching, however, that becomes quite a tricky scheduling problem.

zswanson commented 3 years ago

Can pod getting branch interfaces/security groups co-exist with pods getting ENI prefix IP addresses in the same cluster and even on the same node? Yes

@mikestef9 This is what I meant yes, hopefully your upcoming blog post will address this hybrid use-case too. There's workload on the clusters that doesn't need a pod security group and branch-eni (ie, prometheus, argo, etc) and packing more of those in per-node would be nice.

Would there be any concern about a pod needing the branch-ENI for security groups getting scheduled on a node that is full on ENIs but had room for pods due to the prefixes? Or can the scheduler account for that?