Closed noyoshi closed 1 year ago
while other nodes in my group are not available in that subnet
What do you mean by this? Can you provide more details on the way that you are constraining your provisioner such that we can launch one node in one AZ but we would not be able to launch subsequent nodes in the same AZ?
Hey @noyoshi, I was able to get this working by using both podAffinity
and podAntiAffinity
for my deployment like so
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
namespace: default
spec:
replicas: 10
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
resources:
requests:
cpu: 1
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: topology.kubernetes.io/zone
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["inflate"]
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["inflate"]
➜ karpenter git:(main) kubectl get nodes -l=karpenter.sh/provisioner-name -o=custom-columns=NAME:.metadata.name,ZONE:".metadata.labels.topology\.kubernetes\.io/zone"
NAME ZONE
ip-192-168-100-221.us-west-2.compute.internal us-west-2b
ip-192-168-101-23.us-west-2.compute.internal us-west-2b
ip-192-168-107-24.us-west-2.compute.internal us-west-2b
ip-192-168-108-248.us-west-2.compute.internal us-west-2b
ip-192-168-111-161.us-west-2.compute.internal us-west-2b
ip-192-168-112-41.us-west-2.compute.internal us-west-2b
ip-192-168-115-76.us-west-2.compute.internal us-west-2b
ip-192-168-120-206.us-west-2.compute.internal us-west-2b
ip-192-168-96-70.us-west-2.compute.internal us-west-2b
ip-192-168-99-230.us-west-2.compute.internal us-west-2b
Tell us about your request
I would like to be able to schedule a group of pods onto the same AZ, without having to specify the exact AZ through the topolgy node selector.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I am scheduling a heterogenous cluster of pods, where each pod is on its own dedicated node, and each node can be a different AWS node type (some GPU, some GPU node types). Ideally, I would be able to tell Karpenter to schedule all nodes with a label X onto the same AZ, and let karpenter determine which AZ to place the nodes.
I tried using the podAffinity rules, but we can encounter a race condition where a CPU node in my group will get scheduled in subnet A, while other nodes in my group are not available in that subnet. If the first pod is placed in an invalid subnet for other pods in the group, it causes the nodes that cannot go into subnet A to never come up, or eventually come up but in a different subnet.
Are you currently working around this issue?
I am querying AWS for the available AZs for each node type used in my group of pods, and then finding the intersect of all the AZs for all my nodes in my group, as well as the group of AZs karpenter can place nodes into.
The other workaround would be to just use the pod affinity rule, and get around the race condition by only having karpenter use subnets that can schedule all the node types I want to support. This is not great because as AZs come online, I would not be able to keep the system updated in real time.
Additional Context
No response
Attachments
No response
Community Note