Pod Topology Spread Constraints not working for AKS Windows node pool

markdebruijne commented 2 years ago

What happened:

AKS cluster with both a Linux AKSUbuntu-1804gen2containerd-2022.03.02 and Windows AKSWindows-2019-17763.2686.220309 node pool. Node pools configure with all three avalability zones usable in west-europe region.
AKS cluster level and node pools all running Kubernetes 1.21.9
Pods (within replicaset with 2+) running on Linux backed node pool are spread, although no explicit configured topologySpreadConstraints, equally across nodes (and thus availablility zones).
Pods (within replicaset with 2+) running on Windows backed node pool are NOT spread equally, also not with explicit topologySpreadConstraints configured.

What you expected to happen:

AKS to logical spread across availablity zones and/or hostnames as per the documented default settings: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#internal-default-constraints
AKS to respect explictly spread instruction as configured on deployment level (Windows pod).

How to reproduce it (as minimally and precisely as possible):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: win-world
  labels:
    app: win-world
spec:
  replicas: 6
  selector:
    matchLabels:
      app: win-world
  template:
    metadata:
      labels:
        app: win-world
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: win-world
      containers:
      - name: win-world
        image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
        ports:
        - containerPort: 80
      nodeSelector: 
          kubernetes.io/os: windows

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9", GitCommit:"37f338aa38e0427e127162afe462e2f4150f0ba3", GitTreeState:"clean", BuildDate:"2022-02-07T20:49:26Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}

Size of cluster (how many worker nodes are in the cluster?) 1x Linux node pool with 2 nodes, 1x Windows node pool with 2 nodes

kubernetes.azure.com/node-image-version=AKSUbuntu-1804gen2containerd-2022.03.02
kubernetes.azure.com/os-sku=Ubuntu
kubernetes.azure.com/role=agent
kubernetes.azure.com/storageprofile=managed
kubernetes.azure.com/storagetier=Premium_LRS
kubernetes.io/arch=amd64
kubernetes.io/hostname=aks-webl1-29467775-vmss000002
kubernetes.io/os=linux
kubernetes.io/role=agent
node-role.kubernetes.io/agent=
node.kubernetes.io/instance-type=Standard_D2as_v4
storageprofile=managed
storagetier=Premium_LRS
topology.disk.csi.azure.com/zone=westeurope-2 (or -0, or -1)
topology.kubernetes.io/region=westeurope
topology.kubernetes.io/zone=westeurope-2 (or -0, or -1)

kubernetes.azure.com/node-image-version=AKSWindows-2019-17763.2686.220309
kubernetes.azure.com/role=agent
kubernetes.io/arch=amd64
kubernetes.io/hostname=aksscw1000001
kubernetes.io/os=windows
kubernetes.io/role=agent
node-role.kubernetes.io/agent=
node.kubernetes.io/instance-type=Standard_D8as_v4
node.kubernetes.io/windows-build=10.0.17763
topology.disk.csi.azure.com/zone=westeurope-2 (or -0, or -1)
topology.kubernetes.io/region=westeurope
topology.kubernetes.io/zone=westeurope-2 (or -0, or -1)

General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) ASP.NET + .NET Core, HTTP
Others: Recently upgraded from v.1.20.9 to v.1.21.9, but in previous version also the same issue.

Details about the node that remains empty attached. When i drain node A, pods are being moved to node B.

ghost commented 2 years ago

Hi markdebruijne, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

markdebruijne commented 2 years ago

Forgot to mention, tested it on other deployments / workloads as well. All attempts for Windows backed node pools are not spread equally.

As we've faced outage due to a crash of a node hosting all the container, I have also raised Azure support ticket: 2203230050001873

markdebruijne commented 2 years ago

Microsoft has been able to reproduce the issue and addressed the need for a fix

robbiezhang commented 2 years ago

should it be "app: win-world" instead of "name: win-world" in your topologySpreadConstraints.labelSelector?

          labelSelector:
            matchLabels:
              name: win-world

markdebruijne commented 2 years ago

should it be "app: win-world" instead of "name: win-world" in your topologySpreadConstraints.labelSelector?
          labelSelector:
            matchLabels:
              name: win-world

Yes, you're right. Forgot to update the code snippet in the command afterwards, done now. Also with the correct manifest, I still seems that the topologySpreadConstraints: is not being picked up.

robbiezhang commented 2 years ago

I cannot repo this with your sample. I have a cluster with 2 windows nodes, and each node has be assigned with 3 pods. It's the same behavior as the linux nodes.

when I drain 1 node, i got the scheduling error due to the topology spread constraints.

ghost commented 2 years ago

Action required from @Azure/aks-pm

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

markdebruijne commented 2 years ago

I cannot repo this with your sample. I have a cluster with 2 windows nodes, and each node has be assigned with 3 pods. It's the same behavior as the linux nodes.

Exact the (yours) behavior I'm expecting @robbiezhang The support ticket (mentioned above) is still open. After some revisions in the snippet, and other attempts, a spread schedule is not working. Nor can't it be reproduced.

robbiezhang commented 2 years ago

@markdebruijne , we cannot repro it internally. Wondering whether you can check the node labels in your repro environment.

markdebruijne commented 2 years ago

Current labels on the Windows node(s) @robbiezhang

agentpool: scw1
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: Standard_D8as_v4
beta.kubernetes.io/os: windows
failure-domain.beta.kubernetes.io/region: westeurope
failure-domain.beta.kubernetes.io/zone: westeurope-2
kubernetes.azure.com/agentpool: scw1
kubernetes.azure.com/cluster: MC_rg-dig[**masked**]
kubernetes.azure.com/mode: user
kubernetes.azure.com/node-image-version: AKSWindows-2019-17763.2686.220309
kubernetes.azure.com/role: agent
kubernetes.io/arch: amd64
kubernetes.io/hostname: aksscw1000001
kubernetes.io/os: windows
kubernetes.io/role: agent
node-role.kubernetes.io/agent:
node.kubernetes.io/instance-type: Standard_D8as_v4
node.kubernetes.io/windows-build: 10.0.17763
topology.disk.csi.azure.com/zone: westeurope-2
topology.kubernetes.io/region: westeurope
topology.kubernetes.io/zone: westeurope-2

And on the second the ones that differ

failure-domain.beta.kubernetes.io/region: westeurope
failure-domain.beta.kubernetes.io/zone: westeurope-1
kubernetes.io/hostname: aksscw1000002
topology.disk.csi.azure.com/zone: westeurope-1
topology.kubernetes.io/region: westeurope
topology.kubernetes.io/zone: westeurope-1

markdebruijne commented 2 years ago

Please note: Azure support ticket also pending https://github.com/Azure/AKS/issues/2862#issuecomment-1076385452

ghost commented 2 years ago

Action required from @Azure/aks-pm

chefcook commented 2 years ago

We also have a very similar behaviour as described by @markdebruijne. Hybrid environment - with Linux node it works consistently with Windows nodes not consistently. Often we don't know if this is just a coincidence. In addition, a new node pool is not respected or not included in the topology spread at all.

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

@immuzz, @justindavies would you be able to assist?

Issue Details

**What happened**: - AKS cluster with both a Linux `AKSUbuntu-1804gen2containerd-2022.03.02` and Windows `AKSWindows-2019-17763.2686.220309` node pool. Node pools configure with all three avalability zones usable in `west-europe` region. - AKS cluster level and node pools all running Kubernetes `1.21.9` - Pods (within replicaset with 2+) running on Linux backed node pool **are spread**, although no explicit configured `topologySpreadConstraints`, **equally** across nodes (and thus availablility zones). - Pods (within replicaset with 2+) running on Windows backed node pool **are NOT spread equally**, also not with explicit `topologySpreadConstraints` configured. **What you expected to happen**: - AKS to logical spread across availablity zones and/or hostnames as per the documented default settings: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#internal-default-constraints - AKS to respect explictly spread instruction as configured on deployment level (Windows pod). **How to reproduce it (as minimally and precisely as possible)**: ``` apiVersion: apps/v1 kind: Deployment metadata: name: win-world labels: app: win-world spec: replicas: 6 selector: matchLabels: app: win-world template: metadata: labels: app: win-world spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: win-world containers: - name: win-world image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019 ports: - containerPort: 80 nodeSelector: kubernetes.io/os: windows ``` **Anything else we need to know?**: **Environment**: - Kubernetes version (use `kubectl version`): ``` Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9", GitCommit:"37f338aa38e0427e127162afe462e2f4150f0ba3", GitTreeState:"clean", BuildDate:"2022-02-07T20:49:26Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"} ``` - Size of cluster (how many worker nodes are in the cluster?) 1x Linux node pool with 2 nodes, 1x Windows node pool with 2 nodes ``` kubernetes.azure.com/node-image-version=AKSUbuntu-1804gen2containerd-2022.03.02 kubernetes.azure.com/os-sku=Ubuntu kubernetes.azure.com/role=agent kubernetes.azure.com/storageprofile=managed kubernetes.azure.com/storagetier=Premium_LRS kubernetes.io/arch=amd64 kubernetes.io/hostname=aks-webl1-29467775-vmss000002 kubernetes.io/os=linux kubernetes.io/role=agent node-role.kubernetes.io/agent= node.kubernetes.io/instance-type=Standard_D2as_v4 storageprofile=managed storagetier=Premium_LRS topology.disk.csi.azure.com/zone=westeurope-2 (or -0, or -1) topology.kubernetes.io/region=westeurope topology.kubernetes.io/zone=westeurope-2 (or -0, or -1) ``` ``` kubernetes.azure.com/node-image-version=AKSWindows-2019-17763.2686.220309 kubernetes.azure.com/role=agent kubernetes.io/arch=amd64 kubernetes.io/hostname=aksscw1000001 kubernetes.io/os=windows kubernetes.io/role=agent node-role.kubernetes.io/agent= node.kubernetes.io/instance-type=Standard_D8as_v4 node.kubernetes.io/windows-build=10.0.17763 topology.disk.csi.azure.com/zone=westeurope-2 (or -0, or -1) topology.kubernetes.io/region=westeurope topology.kubernetes.io/zone=westeurope-2 (or -0, or -1) ``` - General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) ASP.NET + .NET Core, HTTP - Others: Recently upgraded from `v.1.20.9` to `v.1.21.9`, but in previous version also the same issue. Details about the node that remains empty attached. When i drain node A, pods are being moved to node B. ![image](https://user-images.githubusercontent.com/19689903/159709022-59d60033-1d79-4c93-adec-94296caf1da3.png)

Author:	markdebruijne
Assignees:	-
Labels:	`bug`, `nodepools`, `windows`, `action-required`, `Needs Attention :wave:`
Milestone:	-

junjiezhang1997 commented 2 years ago

I use kubernetes 1.22.6 (the oldest version supported now) and also cannot reproduce the problem. I'm not sure whether the problem has been fixed with the upgradation of kubernetes version. @markdebruijne could you reproduce the problem now? If so, we may need more details of your operations and corresponding parameters. @chefcook also wish you to provide more details about how to reproduce your problem. Thank you all! cc @AbelHu

Details of my test: az group create --name $myResourceGroup --location westeurope az aks create -g $myResourceGroup -n $myAKSCluster --generate-ssh-keys --windows-admin-username azureuser --windows-admin-password $myPassword --kubernetes-version 1.22.6 --network-plugin azure --vm-set-type VirtualMachineScaleSets --node-vm-size "Standard_D2as_v4" --node-count 2 --zones 1 2 az aks get-credentials -g $myResourceGroup -n $myAKSCluster --overwrite-existing az aks nodepool add --resource-group $myResourceGroup --cluster-name $myAKSCluster --name win19 --os-type Windows --node-count 2 --node-vm-size Standard_D8as_v4 --zones 1 2 kubectl apply -f test-linux.yaml kubectl apply -f test-windows.yaml test-windows.yaml is the same with the above example and test-linux.yaml replaces all the windows with linux and uses nginx as the image. Pods spread equally both on the linux nodes and the windows nodes. When I drain 1 node, things also go well. Pods are all moved to the other node.

markdebruijne commented 2 years ago

@junjiezhang1997 I don't have the ability (time) to attempt to reproduce it myself. But in general we've upgraded multiple AKS clustersto v1.23.8 and still see unbalanced spread on Windows nodes. Also without explicit (thus default) topologySpreadConstraints configuration, while using mutliple nodes which are spread across availability zones

✅ Linux pods of a replicaset are spread across the nodes
❌ Windows pods of a replicaset are NOT spread Even worse, we use (and pay) two times a Standard_D8as_v4 (8 vCore, 32Gb) node, and all a 16 workloads (one with 2 replicas, other singles pods) are running on the same node ... providing a sabitical to the other one that is doing nothing.

So it seems broader than "replicas spread" but more like a scheduling issue in general.

Thinking about specifc characteristics, we do use Kustomize to deploy workloads, in which various deployments are deployed as part of that single deployment action. But with the single win-world from this thread we also had the issue of course.

p.s. we at regular interval have Windows node that have pods remain in a Terminating state. In case that would be related.

junjiezhang1997 commented 2 years ago

@markdebruijne Thanks for your response. I have several questions that need your further clarification:

As you mentioned "without explicit (thus default) topologySpreadConstraints configuration", does it mean the configuration shown in the win-world?
Since you at regular interval have Windows node that have pods remain in a Terminating state, we wonder that:
1. Are the Windows pods in the Terminating state when the spread issue happens (such as for win-world)?
2. Will this issue happen when you apply win-word in a new cluster?

cc @AbelHu

markdebruijne commented 2 years ago

@markdebruijne Thanks for your response. I have several questions that need your further clarification:

As you mentioned "without explicit (thus default) topologySpreadConstraints configuration", does it mean the configuration shown in the win-world? I mean a configuration without topologySpreadConstraints specified at all. In that case it will revert back to the Kubernetes deafult (I would assume) and that seems to be something like maxSkew: 5 based on availability zone.

Since you at regular interval have Windows node that have pods remain in a Terminating state, we wonder that:

Are the Windows pods in the Terminating state when the spread issue happens (such as for win-world)? Not known in this stage.

Will this issue happen when you apply win-word in a new cluster? Not sure about this. We think the chance increased when we deploy multiple K8s deployments at once

To summarize

The spread issues seems to be a more constant behavior
The Termininating issue occurs occasionally

junjiezhang1997 commented 2 years ago

Still cannot repro the issue when I deploy multiple (three in my test) K8s deployments at once. @markdebruijne could you help provide a script with a high chance of reproducing this issue? cc @AbelHu

junjiezhang1997 commented 1 year ago

Closed due to no response for a long time.

Azure / AKS

Pod Topology Spread Constraints not working for AKS Windows node pool #2862