aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.77k stars 954 forks source link

Support multi node - same AZ scheduling of pods #3109

Closed noyoshi closed 1 year ago

noyoshi commented 1 year ago

Tell us about your request

I would like to be able to schedule a group of pods onto the same AZ, without having to specify the exact AZ through the topolgy node selector.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

I am scheduling a heterogenous cluster of pods, where each pod is on its own dedicated node, and each node can be a different AWS node type (some GPU, some GPU node types). Ideally, I would be able to tell Karpenter to schedule all nodes with a label X onto the same AZ, and let karpenter determine which AZ to place the nodes.

I tried using the podAffinity rules, but we can encounter a race condition where a CPU node in my group will get scheduled in subnet A, while other nodes in my group are not available in that subnet. If the first pod is placed in an invalid subnet for other pods in the group, it causes the nodes that cannot go into subnet A to never come up, or eventually come up but in a different subnet.

Are you currently working around this issue?

I am querying AWS for the available AZs for each node type used in my group of pods, and then finding the intersect of all the AZs for all my nodes in my group, as well as the group of AZs karpenter can place nodes into.

The other workaround would be to just use the pod affinity rule, and get around the race condition by only having karpenter use subnets that can schedule all the node types I want to support. This is not great because as AZs come online, I would not be able to keep the system updated in real time.

Additional Context

No response

Attachments

No response

Community Note

jonathan-innis commented 1 year ago

while other nodes in my group are not available in that subnet

What do you mean by this? Can you provide more details on the way that you are constraining your provisioner such that we can launch one node in one AZ but we would not be able to launch subsequent nodes in the same AZ?

jonathan-innis commented 1 year ago

Hey @noyoshi, I was able to get this working by using both podAffinity and podAntiAffinity for my deployment like so

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
  namespace: default
spec:
  replicas: 10
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: topology.kubernetes.io/zone
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values: ["inflate"]
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: kubernetes.io/hostname
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values: ["inflate"]
➜  karpenter git:(main) kubectl get nodes -l=karpenter.sh/provisioner-name -o=custom-columns=NAME:.metadata.name,ZONE:".metadata.labels.topology\.kubernetes\.io/zone" 
NAME                                            ZONE
ip-192-168-100-221.us-west-2.compute.internal   us-west-2b
ip-192-168-101-23.us-west-2.compute.internal    us-west-2b
ip-192-168-107-24.us-west-2.compute.internal    us-west-2b
ip-192-168-108-248.us-west-2.compute.internal   us-west-2b
ip-192-168-111-161.us-west-2.compute.internal   us-west-2b
ip-192-168-112-41.us-west-2.compute.internal    us-west-2b
ip-192-168-115-76.us-west-2.compute.internal    us-west-2b
ip-192-168-120-206.us-west-2.compute.internal   us-west-2b
ip-192-168-96-70.us-west-2.compute.internal     us-west-2b
ip-192-168-99-230.us-west-2.compute.internal    us-west-2b