Cluster-Autoscaler scales randomly

eaimounir commented 5 years ago

What happened: One of our cluster in production started scaling up and down kind of randomly and it resulted in creating and deleting containers to the point where the platform was useless. The only action done by the devops was to set the number of replicas of deployment from 4 to 2. I tried to set the number of nodes manually through portal.azure.com but all of my changes were overridden by the autoscaler. At some point, the autoscaler pod was not able to start because of the lack of resource since it scaled down to 2 or 3 nodes where a few minutes earlier we had 8 nodes. I deleted the autoscaler deployemnt and all the accounts in the cluster however the autoscaler keeps scaling. I tried to manually set the number of nodes but it keeps changing it, scaling down slowly to eventually reach 3 (default and the minimal number of nodes). I can see in the activity logs that the service principal used to create the autoscaler triggers the scale.

What you expected to happen: I was expecting the cluster to scale down by 1 or 2 nodes since the application was using half the resources of a node.

How to reproduce it (as minimally and precisely as possible): Despite not having the autoscaler installed in the cluster, it keeps updating it. Every time I set the scale, it is replaced by whatever the autoscaler decided.

Anything else we need to know?: I upgraded the kubernetes version The cluster autoscaler was installed following the instructions from the ACS https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/README.md I'm wondering if the cluster-autoscaler based on ACS for the new versions of kubernetes is good. I didn't enable the preview features as explained https://docs.microsoft.com/en-us/azure/aks/cluster-autoscaler#disable-the-cluster-autoscaler and I didn't need it so far

Environment:

Kubernetes version (use kubectl version): Client v1.12.0 Server: currently 1.13.5, but the issue appears in a previous version and I tried upgrading the version of kubernetes
Size of cluster (how many worker nodes are in the cluster?) Standard DS12 v2 (4 vcpus, 28 GB memory)
General description of workloads in the cluster Node applications and java

feiskyer commented 5 years ago

@eaimounir Do you still have the logs of CA when the issue happening? And did you use the matched version for your cluster?

eaimounir commented 5 years ago

@feiskyer, I didn't update the version of the cluster-autoscaler when upgrading the cluster. I didn't know that I have to update the version of the autoscaler when upgrading the cluster! Should I install it again matching the version 1.13.5? I would like to emphasize that I deleted the deployment but the autoscaling is still acting against any manual change on the cluster. Without pod cluster-autoscaler, when I set the scale to for example 6, it comes back to 3!!! Please, find the cluster-autoscaler's logs for the last 10 days autoscaler_logs.xlsx

feiskyer commented 5 years ago

For the logs, CA 1.2.2 is used. For Kubernetes v1.13.5, please use its matched version: Image: gcr.io/google-containers/cluster-autoscaler:v1.13.4.

@eaimounir Could you upgrade the CA version?

eaimounir commented 5 years ago

Using the Image: gcr.io/google-containers/cluster-autoscaler:v1.13.4. to match the version Kubernetes v1.13.5 didn't help! The cluster remains with 3 nodes and several pods are Pending waiting for more CPU. I can see attempts from the autoscaler but it is not working as it should. I attached the logs from last night. activity_log_May_1.xlsx I created a new cluster with a matching version and it is working there. I'm going to delete the cluster and won't be able to provide more information on the problem. As far as I'm concerned, we can close the issue unless you want to spend more time investigating the root cause. @feiskyer Thanks for the help

feiskyer commented 5 years ago

from the logs, update the node count for AKS has failed, but the error message is not clear why it's failing. If the old AKS cluster is still there, could you check its activity logs for the update error message?

Since the new cluster has been running OK now, I think it's ok to close the issue.

Azure / AKS

Cluster-Autoscaler scales randomly #936