Closed markdebruijne closed 1 year ago
Hi markdebruijne, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
Forgot to mention, tested it on other deployments / workloads as well. All attempts for Windows backed node pools are not spread equally.
As we've faced outage due to a crash of a node hosting all the container, I have also raised Azure support ticket: 2203230050001873
Microsoft has been able to reproduce the issue and addressed the need for a fix
should it be "app: win-world" instead of "name: win-world" in your topologySpreadConstraints.labelSelector?
labelSelector:
matchLabels:
name: win-world
should it be "app: win-world" instead of "name: win-world" in your topologySpreadConstraints.labelSelector?
labelSelector: matchLabels: name: win-world
Yes, you're right. Forgot to update the code snippet in the command afterwards, done now.
Also with the correct manifest, I still seems that the topologySpreadConstraints:
is not being picked up.
I cannot repo this with your sample. I have a cluster with 2 windows nodes, and each node has be assigned with 3 pods. It's the same behavior as the linux nodes.
when I drain 1 node, i got the scheduling error due to the topology spread constraints.
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
I cannot repo this with your sample. I have a cluster with 2 windows nodes, and each node has be assigned with 3 pods. It's the same behavior as the linux nodes.
Exact the (yours) behavior I'm expecting @robbiezhang The support ticket (mentioned above) is still open. After some revisions in the snippet, and other attempts, a spread schedule is not working. Nor can't it be reproduced.
@markdebruijne , we cannot repro it internally. Wondering whether you can check the node labels in your repro environment.
Current labels on the Windows node(s) @robbiezhang
agentpool: scw1
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: Standard_D8as_v4
beta.kubernetes.io/os: windows
failure-domain.beta.kubernetes.io/region: westeurope
failure-domain.beta.kubernetes.io/zone: westeurope-2
kubernetes.azure.com/agentpool: scw1
kubernetes.azure.com/cluster: MC_rg-dig[**masked**]
kubernetes.azure.com/mode: user
kubernetes.azure.com/node-image-version: AKSWindows-2019-17763.2686.220309
kubernetes.azure.com/role: agent
kubernetes.io/arch: amd64
kubernetes.io/hostname: aksscw1000001
kubernetes.io/os: windows
kubernetes.io/role: agent
node-role.kubernetes.io/agent:
node.kubernetes.io/instance-type: Standard_D8as_v4
node.kubernetes.io/windows-build: 10.0.17763
topology.disk.csi.azure.com/zone: westeurope-2
topology.kubernetes.io/region: westeurope
topology.kubernetes.io/zone: westeurope-2
And on the second the ones that differ
failure-domain.beta.kubernetes.io/region: westeurope
failure-domain.beta.kubernetes.io/zone: westeurope-1
kubernetes.io/hostname: aksscw1000002
topology.disk.csi.azure.com/zone: westeurope-1
topology.kubernetes.io/region: westeurope
topology.kubernetes.io/zone: westeurope-1
Please note: Azure support ticket also pending https://github.com/Azure/AKS/issues/2862#issuecomment-1076385452
Action required from @Azure/aks-pm
We also have a very similar behaviour as described by @markdebruijne. Hybrid environment - with Linux node it works consistently with Windows nodes not consistently. Often we don't know if this is just a coincidence. In addition, a new node pool is not respected or not included in the topology spread at all.
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
@immuzz, @justindavies would you be able to assist?
Author: | markdebruijne |
---|---|
Assignees: | - |
Labels: | `bug`, `nodepools`, `windows`, `action-required`, `Needs Attention :wave:` |
Milestone: | - |
I use kubernetes 1.22.6
(the oldest version supported now) and also cannot reproduce the problem. I'm not sure whether the problem has been fixed with the upgradation of kubernetes version. @markdebruijne could you reproduce the problem now? If so, we may need more details of your operations and corresponding parameters. @chefcook also wish you to provide more details about how to reproduce your problem. Thank you all! cc @AbelHu
Details of my test:
az group create --name $myResourceGroup --location westeurope
az aks create -g $myResourceGroup -n $myAKSCluster --generate-ssh-keys --windows-admin-username azureuser --windows-admin-password $myPassword --kubernetes-version 1.22.6 --network-plugin azure --vm-set-type VirtualMachineScaleSets --node-vm-size "Standard_D2as_v4" --node-count 2 --zones 1 2
az aks get-credentials -g $myResourceGroup -n $myAKSCluster --overwrite-existing
az aks nodepool add --resource-group $myResourceGroup --cluster-name $myAKSCluster --name win19 --os-type Windows --node-count 2 --node-vm-size Standard_D8as_v4 --zones 1 2
kubectl apply -f test-linux.yaml
kubectl apply -f test-windows.yaml
test-windows.yaml
is the same with the above example and test-linux.yaml
replaces all the windows with linux and uses nginx as the image. Pods spread equally both on the linux nodes and the windows nodes.
When I drain 1 node, things also go well.
Pods are all moved to the other node.
@junjiezhang1997 I don't have the ability (time) to attempt to reproduce it myself. But in general we've upgraded multiple AKS clustersto v1.23.8
and still see unbalanced spread on Windows nodes. Also without explicit (thus default) topologySpreadConstraints
configuration, while using mutliple nodes which are spread across availability zones
Standard_D8as_v4
(8 vCore, 32Gb) node, and all a 16 workloads (one with 2 replicas, other singles pods) are running on the same node ... providing a sabitical to the other one that is doing nothing. So it seems broader than "replicas spread" but more like a scheduling issue in general.
Thinking about specifc characteristics, we do use Kustomize to deploy workloads, in which various deployments are deployed as part of that single deployment action. But with the single win-world
from this thread we also had the issue of course.
p.s. we at regular interval have Windows node that have pods remain in a Terminating state. In case that would be related.
@markdebruijne Thanks for your response. I have several questions that need your further clarification:
topologySpreadConstraints
configuration", does it mean the configuration shown in the win-world
?win-world
)?win-word
in a new cluster?cc @AbelHu
@markdebruijne Thanks for your response. I have several questions that need your further clarification:
As you mentioned "without explicit (thus default)
topologySpreadConstraints
configuration", does it mean the configuration shown in thewin-world
? I mean a configuration withouttopologySpreadConstraints
specified at all. In that case it will revert back to the Kubernetes deafult (I would assume) and that seems to be something likemaxSkew: 5
based on availability zone.Since you at regular interval have Windows node that have pods remain in a Terminating state, we wonder that:
- Are the Windows pods in the Terminating state when the spread issue happens (such as for
win-world
)? Not known in this stage.- Will this issue happen when you apply
win-word
in a new cluster? Not sure about this. We think the chance increased when we deploy multiple K8s deployments at once
To summarize
Still cannot repro the issue when I deploy multiple (three in my test) K8s deployments at once. @markdebruijne could you help provide a script with a high chance of reproducing this issue? cc @AbelHu
Closed due to no response for a long time.
What happened:
AKSUbuntu-1804gen2containerd-2022.03.02
and WindowsAKSWindows-2019-17763.2686.220309
node pool. Node pools configure with all three avalability zones usable inwest-europe
region.1.21.9
topologySpreadConstraints
, equally across nodes (and thus availablility zones).topologySpreadConstraints
configured.What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
):v.1.20.9
tov.1.21.9
, but in previous version also the same issue.Details about the node that remains empty attached. When i drain node A, pods are being moved to node B.