Open tallaxes opened 8 months ago
Hi @tallaxes, I'm trying the workaround to spread pods between different zones, but still Karpenter fails to create the nodeClaim. I didn't add the zone requirement in the NodePool.
I'm trying to schedule this deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 2
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: inflate
containers:
- name: inflate
image: mcr.microsoft.com/oss/kubernetes/pause:3.6
resources:
requests:
cpu: 1
but I receive this error in the Karpenter logs
{"level":"ERROR","time":"2024-06-05T16:28:58.144Z","logger":"controller.provisioner","message":"creating node claim, NodeClaim.karpenter.sh \"general-purpose-kfwpp\" is invalid: spec.requirements[2].key: Invalid value: \"string\": label domain \"kubernetes.io\" is restricted; creating node claim, NodeClaim.karpenter.sh \"general-purpose-h8j2d\" is invalid: spec.requirements[5].key: Invalid value: \"string\": label domain \"kubernetes.io\" is restricted","commit":"bbaa9b7"}
Using for example topologyKey: "kubernetes.io/hostname"
Karpenter does schedule 2 new nodes and one pod for each node, but both the nodes are in the same zone.
Can you give me more info about your workaround? Thank you
Hello @tallaxes @Bryce-Soghigian
I've been throughly testing karpenter as I plan to leverage it on my workloads and I am also having problems with the zone configuration in the nodepool. I have a nodepool configured with the following requirement:
...
- key: karpenter.azure.com/zone
operator: In
values: ["eastus-1", "eastus-2", "eastus-3"]
Whenever I apply it, i get the following message:
The NodePool "krptpool" is invalid: spec.template.spec.requirements[6].key: Invalid value: "string": label domain "karpenter.azure.com" is restricted
.
Thus, the workaround mentioned doesn't seem to work (doesn't even apply here, unless I am doing something really silly).
Can you guys please take a look at it?
Also, I would like to point out that the documentation here https://learn.microsoft.com/en-us/azure/aks/node-autoprovision?tabs=azure-cli#sku-selectors-with-well-known-labels provides another key (also reserved) that doesn't work topology.kubernetes.io/zone so it would be a good idea to update it imo.
Version
Karpenter Version: https://github.com/Azure/karpenter-provider-azure/commit/99d1bb0730b3462e40267ec28024368c90801b26 (current main)
Kubernetes Version: v1.27.9
Expected Behavior
One should be able to specify requirements with
karpenter.azure.com/zone
constraint, for example to only provision nodes in a specific zone, without adverse effects.Actual Behavior
Specifying any kind of
karpenter.azure.com/zone
constraint in a NodePool current triggers continuous drift.Here is what I am thinking is going on. Right now, we cannot (and do not) record this requirement/constraint as a label on NodeClaim. This is because Karpenter will try applying all of these as labels to Node object - and
topology.kubernetes.io/zone
is a protected label in AKS. (It will be applied to a new Node correctly, but by a different component). So for now, as a workaround, we use an alternative labelkarpenter.azure.com/zone
. I suspect that it is this discrepancy that causes Karpenter to detect Requirements Drift: Based of NodePool, the NodeClaim is expected to have the zone label, and it does not => out of spec, to be replaced. I also suspect that, while we do have E2E tests in this area, they likely only test that the node gets provisioned, and don't notice the subsequent drift.Steps to Reproduce the Problem
Use NodePool with any kind of
karpenter.azure.com/zone
requirement.Resource Specs and Logs
Continuous drift observed.
Workaround
Specify zone-based constraints (including
topologySpreadConstraint
, if needed) via workload, rather than NodePool.Community Note