Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 304 forks source link

[BUG] Random autoscale error #4365

Open barkep opened 2 months ago

barkep commented 2 months ago

Describe the bug Autoscaler stops working at random times. It goes into initialization state. I fix it by setting scale method to manual and then back to autoscale.

To Reproduce The error is random, I can't simulate it.

Run command 'kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml'

  apiVersion: v1
  data:
    status: |-
      Cluster-autoscaler status at 2024-06-26 05:17:03.13712722 +0000 UTC:
      Initializing
  kind: ConfigMap
  metadata:
    annotations:
      cluster-autoscaler.kubernetes.io/last-updated: 2024-06-26 05:17:03.13712722 +0000
        UTC
    creationTimestamp: "2024-06-26T05:17:03Z"
    name: cluster-autoscaler-status
    namespace: kube-system
    resourceVersion: "****"
    uid: ****

When I change to manual and back to autoscale it works fine.

apiVersion: v1
data:
  status: |+
    Cluster-autoscaler status at 2024-06-26 05:47:27.226243208 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=2 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=2 longUnregistered=0)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleUp:     NoActivity (ready=2 registered=2)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleDown:   CandidatesPresent (candidates=1)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:42:35.843690766 +0000 UTC m=+312.090831134

    NodeGroups:
      Name:        aks-poold4sv3-****-vmss
      Health:      Healthy (ready=1 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=1, maxSize=3))
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleUp:     NoActivity (ready=1 cloudProviderTarget=1)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187

      Name:        aks-poold8lsv5-****-vmss
      Health:      Healthy (ready=1 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=0, maxSize=10))
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleUp:     NoActivity (ready=1 cloudProviderTarget=1)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:37:34.758946819 +0000 UTC m=+11.006087187
      ScaleDown:   CandidatesPresent (candidates=1)
                   LastProbeTime:      2024-06-26 05:47:26.512785502 +0000 UTC m=+602.759926070
                   LastTransitionTime: 2024-06-26 05:42:35.843690766 +0000 UTC m=+312.090831134

kind: ConfigMap
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/last-updated: 2024-06-26 05:47:27.226243208 +0000
      UTC
  creationTimestamp: "2024-06-26T05:37:22Z"
  name: cluster-autoscaler-status
  namespace: kube-system
  resourceVersion: "***"
  uid: ****

Environment (please complete the following information):

microsoft-github-policy-service[bot] commented 1 month ago

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

kevinkrp93 commented 1 month ago

Can you please file a support ticket the next time this happens and update it here.