Multi AZ Subnet to be selected only if there is available IPs

zakariais commented 1 year ago

Version

Karpenter Version: v0.18.1 Kubernetes Version: v1.23

Expected Behavior

Karpenter should select a subnet with available IPs from all subnets available for EKS. We are facing a issue where subnet from one AZ is running out of IPs, we have one subnet for each AZs. I understand the subnet is chosen randomly, which is ok, but a given subnet might run out of IP, and it would be better if Karpenter selects subnet with (the most?) available IPs from all subnets available in a VPC.

Actual Behavior

If multiple subnets from different AZs are available, karpenter choose one randomly, without considering if the subnet has IPs available. And unfortunately, we have a situation where we are running out of IPs in specific availability zones and karpenter seems to be creating more instances on that zone without even considering other subnets in other AZs.

Steps to Reproduce the Problem

Have multiple subnet that matches the subnetSelector from your provisioner for different AZs. One of those subnet must have no free IPs.
Scale up a deployment to ensure Karpenter needs to create instance (I use inflate deployment from Getting started tutorial) and with a few tries, most probably you will end up with the provisioner to select this given subnet.

Resource Specs and Logs

Provisioner spec:

spec:
  kubeletConfiguration: {}
  labels:
    group: default
  limits: {}
  provider:
    apiVersion: extensions.karpenter.sh/v1alpha1
    instanceProfile: <our_instance_profile_name>
    kind: AWS
    launchTemplate: <our_launch_template_name>
    securityGroupSelector:
      karpenter.sh/cluster/clusterName: "owned"
    subnetSelector:
      karpenter.sh/cluster/clusterName: "owned"
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - m5.4xlarge
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-east-1a
    - us-east-1b
    - us-east-1c
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 86400

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

FernandoMiguel commented 1 year ago

Part of your problem comes from using SPOT, which will choose the cheapest AZ AZ prices can be dramatically different between them

the other part of the problem comes from https://github.com/aws/karpenter/issues/2572

you can consider using CGNAT to extend your IP range. using 100.64/19 AZs would increase your IP pool to values that you would never have an issue anytime soon

but yes, Karpenter could do a better job at avoiding AZs without IPs in the meantime you could use different provisioners , each with different AZs in them, and weight the provisioners, avoiding the more used AZs.

liorfranko commented 1 year ago

We wouldn't want to override Karpenter choosing the cheapest AZ. We think that Karpenter should choose the cheapest AZ, but if the subnet is full -> fallback to the next AZ.

tjhiggins commented 1 year ago

I was just about to create a similar issue. Agreed with @liorfranko on picking the cheapest, but falling back to the next available. Also using spot and running out of ips. I have 3 subnets each in their own AZ and it mostly only allocates ips in the first AZ.

FernandoMiguel commented 1 year ago

Class C CIDRs are too small for EKS given each pod eats one IP. Take a look https://aws.amazon.com/blogs/containers/addressing-ipv4-address-exhaustion-in-amazon-eks-clusters-using-private-nat-gateways/

tjhiggins commented 1 year ago

Class C CIDRs are too small for EKS given each pod eats one IP. Take a look https://aws.amazon.com/blogs/containers/addressing-ipv4-address-exhaustion-in-amazon-eks-clusters-using-private-nat-gateways/

Can you still expose services publicly if you use a private nat gateway? We use traefik behind an nlb on eks.

The Karpenter terraform install tutorial should probably be updated to use larger subnets then or use a private nat gateway.

FernandoMiguel commented 1 year ago

@tjhiggins "carrier grade NAT" is an industry name. It has nothing to do with AWS VPC Nat Gateways. Nothing changes in that part of the infrastructure.

Karpenter is agnostic. It's up to practitioners to architect their infrastructure. AWS Best Practices do recommend clients to use extended subnets so they don't run out of IPs

FernandoMiguel commented 1 year ago

given that Traefik is just another Pod running in the cluster, and the ALB/NLB will be talking to those Pods IPs, everything is perfectly reachable. nothing changes, other than having a private subnet IP in the CIDR range 10/16, your Pods will be in 100.64/19, while your EC2 ENIs keep the 10.x/16 IP

tjhiggins commented 1 year ago

@tjhiggins "carrier grade NAT" is an industry name. It has nothing to do with AWS VPC Nat Gateways. Nothing changes in that part of the infrastructure.

Karpenter is agnostic. It's up to practitioners to architect their infrastructure. AWS Best Practices do recommend clients to use extended subnets so they don't run out of IPs

I understand that Karpenter eventually wants to be agnostic, but it currently only supports aws and has documentation for creating your VPC which could be updated to use a private nat: https://karpenter.sh/v0.18.1/getting-started/getting-started-with-terraform/#create-a-cluster

Thank you for the suggestion on a private nat and I will give that a go.

bwagner5 commented 1 year ago

Karpenter does sort IPs for subnets that are in the same AZ. But when a request is made for capacity that doesn't constrain the AZ, we defer to EC2 Fleet to make that decision. Since Fleet is unaware of pods requiring IP addresses, it could make a decision where there's enough IPs for the node but not for pods. We could assume that pods will need IP addresses and limit subnets if they don't have enough IPs for max-pods +1 (probably through instance type offerings). This is probably safe in most cases, although doesn't make sense for cases where you are using an overlay network CNI.

If you are able to add more subnets in the same AZ, Karpenter should select the one with the most IPs available.

James-Quigley commented 1 year ago

You can utilize Pod Topology Spread Contraints to help evenly distribute your workloads across AZs

igoratencompass commented 1 year ago

We have experienced the same issue with uneven distribution of nodes in AZ, You can see here Karpenter launching 10 nodes in the same 10.171.236.x subnet in the same AZ:

$ kc get nodes -l karpenter.sh/provisioner-name=default -o wide
NAME                                           STATUS   ROLES    AGE     VERSION                INTERNAL-IP      EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-171-228-10.eu-west-1.compute.internal    Ready    <none>   7d3h    v1.22.15-eks-fb459a0   10.171.228.10    <none>        Amazon Linux 2   5.4.219-126.411.amzn2.x86_64   containerd://1.6.6
ip-10-171-230-123.eu-west-1.compute.internal   Ready    <none>   7h22m   v1.22.15-eks-fb459a0   10.171.230.123   <none>        Amazon Linux 2   5.4.219-126.411.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-121.eu-west-1.compute.internal   Ready    <none>   31m     v1.22.15-eks-fb459a0   10.171.233.121   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-203.eu-west-1.compute.internal   Ready    <none>   31m     v1.22.15-eks-fb459a0   10.171.233.203   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-205.eu-west-1.compute.internal   Ready    <none>   31m     v1.22.15-eks-fb459a0   10.171.233.205   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-30.eu-west-1.compute.internal    Ready    <none>   31m     v1.22.15-eks-fb459a0   10.171.233.30    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-233-61.eu-west-1.compute.internal    Ready    <none>   4h12m   v1.22.15-eks-fb459a0   10.171.233.61    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-109.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.109   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-12.eu-west-1.compute.internal    Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.12    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-126.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.126   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-157.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.157   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-191.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.191   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-200.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.200   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-243.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.243   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-245.eu-west-1.compute.internal   Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.245   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-64.eu-west-1.compute.internal    Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.64    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-236-67.eu-west-1.compute.internal    Ready    <none>   30m     v1.22.15-eks-fb459a0   10.171.236.67    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-239-179.eu-west-1.compute.internal   Ready    <none>   32m     v1.22.15-eks-fb459a0   10.171.239.179   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-239-196.eu-west-1.compute.internal   Ready    <none>   32m     v1.22.15-eks-fb459a0   10.171.239.196   <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-239-30.eu-west-1.compute.internal    Ready    <none>   33m     v1.22.15-eks-fb459a0   10.171.239.30    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6
ip-10-171-239-65.eu-west-1.compute.internal    Ready    <none>   32m     v1.22.15-eks-fb459a0   10.171.239.65    <none>        Amazon Linux 2   5.4.226-129.415.amzn2.x86_64   containerd://1.6.6

It seems like Karpenter decided 10 nodes are needed to allocate the pending pods, chose an AZ and I assume randomly chose a subnet from that AZ and launched all 10 of them in that one. While I agree topology spread constraint can help here I would still expect Karpenter to implement better randomization logic while choosing the subnet i.e. select 5 subnets and launch 2 nodes per each for example.

ellistarn commented 1 year ago

Right now, the algorithm optimizes for cost.

We've heard this feedback a fair bit. One option is to inject an implicit zonal topology rule into each pod as part of scheduling, unless of course the user has defined a different topology rule. This will yield default spread behavior across workloads, resulting in rough capacity balance.

stevehipwell commented 1 year ago

Wouldn't custom networking (secondary subnets) solve this?

CorianderCake commented 1 year ago

Faced the same issue, a workaround I've implemented was to add a stage in my CD that queries the private subnets, retrieves the one with the lowest available IPs and inject it in a "NOT IN" affinity stanza to my helm chart.

dixneuf19 commented 1 year ago

Hi we had the same issue in our EKS cluster, with a subnet with no more IP addresses left, and pod taking hours to start because the CNI was waiting for an available IP. We found some workloads not using spread tolerations so fixing this might help. However I agree that having such a feature would a nice improvement for Karpenter reliability !

jonathan-innis commented 1 year ago

The main issue that we've run into here is we have no idea exactly how many pods will schedule to a given node so when an AZ is near exhausting all of its available IPs, it's unclear whether we should exclude this AZ or whether we should allow it to pass through.

In the other case where we just try to prioritize the AZs that have more available IPs, we run into issues with cost optimization because the the AZ with more IPs may actually be more expensive.

In a way, this issue is very similar to #1292 where there is an ask to create an implicit provisioner-wide topologySpread on AZs.

dougbyrne commented 1 year ago

Wouldn't custom networking (secondary subnets) solve this?

It can help, but you can still run out of IPs. A price or availability difference between AZs can result in all nodes being launched in one AZ.

stevehipwell commented 1 year ago

Wouldn't custom networking (secondary subnets) solve this?

It can help, but you can still run out of IPs. A price or availability difference between AZs can result in all nodes being launched in one AZ.

@dougbyrne the custom networking guidance is to split a /16 CIDR block over your zones, so I'm not sure how you would run out of IPs while also not being able to satisfy a topology spread constraint or significantly impacting the instance pricing balance back to equilibrium?

dougbyrne commented 1 year ago

I might be thinking of a different feature. I've added additional subnets, but each subnet is still associated with a specific zone. The example given in the AWS docs does the same: https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html#custom-networking-configure-vpc

If I'm missing something I'd love to know because what you're describing is what I want.

stevehipwell commented 1 year ago

@dougbyrne you'd be creating a subnet per zone, but based on the recommended /16 CIDR from the CG-NAT space that's over 21k pods per AZ. I'd suggest looking at the EKS best practice guides and configuring both custom networking and IP prefix mode together; IMHO this should be the default configuration.

https://aws.github.io/aws-eks-best-practices/networking/custom-networking/ https://aws.github.io/aws-eks-best-practices/networking/prefix-mode/index_linux/

cdenneen commented 4 months ago

Currently issue is I have let's say 2-3 subnets in each AZ (/24 each) the spot says az-1 has cheapest price. I end up with Karpenter seeing all the subnets for all az's (including multiple for az-1) but ends up exhausting 1 subnet completely. It is not distributing the nodes across all subnets in that particular subnet.

stevehipwell commented 3 months ago

@cdenneen it sounds like you need to configure topology spread constraints if you have more specific requirements than just the cheapest compute.

cdenneen commented 3 months ago

@stevehipwell there is supposed to be backend issue the Karpenter team knows about that when spot is used it doesn't take into account IP exhaustion in the subnet and when 1f for example is deemed cheapest for spot but there are 2-3 subnets for 1f it exhausts one and not the others. Issue is the first 1f is used for the node and its pods by default but if multiple nodes in 1f (subnet 1) instead of split across the multiple 1f subnets. Support is working with Karpenter team on this.

stevehipwell commented 3 months ago

@cdenneen having Karpenter understand IP exhaustion seems like a sensible idea. But I'd expect it to be largely unnecessary for clusters with topology constraints configured as per recommendations. Unless maybe the AZs have widely different IP sizes.

I guess my point is are you down to your last couple of IPs on all AZs where Karpenter knowing the limits might help, or is your cluster heavily biased towards the AZ with the cheapest instances? Also is there a reason you can't use secondary networking?

Even if Karpenter were to understand IP limits, wouldn't the whole system break down once you had provisioned nodes with more availability for pods than there were free IPs? Karpenter doesn't control scheduling for pods onto existing nodes so this would be a K8s scheduler responsibility.

dougbyrne commented 3 months ago

@stevehipwell balanced usage of AZs does not ensure that the multiple subnets within a single AZ are used in a balanced way. Enhanced subnet discovery in the VPC ENI might help here.

stevehipwell commented 3 months ago

@dougbyrne I agree it could be useful. My point was that secondary networking would make it a non-issue. If secondary networking wasn't possible for some reason (I'm not sure I can think of one), then topology spread would likely take you as far as you'd get within the operation parameters of the K8s scheduler even if Karpenter was aware.

jessebye commented 2 months ago

In our case, we hit this problem because our teams create dozens of namespaces with single-replica pods for development. Topology spread doesn't work because A) the pods are distributed across many namespaces, and spread is only within a namespace, and B) each of the pods is a single replica and spread is really geared toward multiple replica pods. I guess we could try spread based on a shared label across all our service pods and see if that helps spread them across zones, but feels like it doesn't fit the intended use case for spread and might not work.

While we are trying to switch to use of a secondary CIDR and larger subnets, we are still concerned about the scenario of a zone failing and taking out the majority of our pods at once. Ideally, Karpenter could have a knob to force zone spread even if the spot pricing is better in one zone to avoid this scenario.

stevehipwell commented 2 months ago

@jessebye for you scenario it sounds like you could use a soft pod anti-affinity backed by a label.

If each of your singletons is given the singleton: true label then the following code would spread them out fairly evenly across your zones.

podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: singleton
              operator: Exists
          topologyKey: topology.kubernetes.io/zone

aws / karpenter-provider-aws