Azure / AKS

Azure Kubernetes Service
1.92k stars 284 forks source link

[BUG] Scaling down with Pod Topology Spread Constraints #4201

Open GerardLarwa opened 1 month ago

GerardLarwa commented 1 month ago

Describe the bug AKS CA scaling down is not working properly with Pod Topology Spread Constraints enabled when all available nodes already in use by given deployment.

Assuming there is a node eligible for scaling down (due to low resources allocated), it shouldn't be taken into consideration for Pod Topology Spread Constraints calculations while checking whether the pods can be rescheduled on another nodes (by CA 'simulating' kube-scheduler).

Currently, CA is not scaling down with the following message - 'nodeA is not suitable for removal: can reschedule only 0 out of 1 pods'

To Reproduce Steps to reproduce the behavior:

  1. Let's assume there are 3 nodes
  1. Let's assume there is deployment with 3 pods scheduled as below.
Pod Node
xyz-1 nodeA
xyz-2 nodeB
xyz-3 nodeC
  1. Above deployment has Pod Topology Spread Constraints enabled.
topologySpreadConstraints:
      - maxSkew: 1
        minDomains: 2
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - xyz
        matchLabelKeys:
        - pod-template-hash
        nodeAffinityPolicy: Honor
        nodeTaintsPolicy: Honor
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
  1. nodeC is a candidate for scaling down due to low resources allocated (below the CA threshold). However, scaling down never happens.
If I'm not mistaken, the autoscaler checks whether it is possible to reschedule pods on other nodes before scaling events. This mean pod xyz-3 would need to be able to be scheduled on nodeA or nodeB. However, seems it violates Pod Topology Spread Constraints maxSkew, as nodeC is not skipped in calculations. Distribution of pods after scaling down could look like this: Node Number of pods
nodeA 1
nodeB 2
nodeC 0

The skew (=2) for above spread is bigger than maxSkew(=1) and node cannot be scaled down.

Autoscaler logs: node nodeC is not suitable for removal: can reschedule only 0 out of 1 pods

Auto scaling (in this particular case) works fine with no Pod Topology Spread Constrains enabled.

Expected behavior nodeC is skipped in CA kube-scheduler-simulated calculations allowing to scale down. In this case the distribution of pods after scaling operations can looks like below. The skew in this case is equal to 1. Node Number of pods
nodeA 1
nodeB 2
~nodeC~ ~0~

Environment:

kevinkrp93 commented 17 hours ago

@GerardLarwa Ack. Do you have an existing support ticket for this ?

GerardLarwa commented 6 hours ago

@GerardLarwa Ack. Do you have an existing support ticket for this ?

No, I don't.