atlassian / escalator

Escalator is a batch or job optimized horizontal autoscaler for Kubernetes
Apache License 2.0
663 stars 59 forks source link

Scaling up stops working when max_nodes is reached #107

Closed leogdiniz closed 6 years ago

leogdiniz commented 6 years ago

If the cluster already has the max number of nodes with some tainted nodes, when the scale-up threshold is reached nothing will happen. No node will be untainted and of course, no node will be added to the cloud provider as well. This bug also causes another problem. If you have tainted nodes and need to scale up, the max number of nodes that will be untainted at a time will be the difference between the max_nodes and the current number of nodes. E.g: Nodes = 9 Max number = 10 Tainted = 5 Need more 3 nodes? Instead of untainting 3 nodes at once, only 1 node will be untainted each time scaling up is ran.

Another example from a log: time="2018-06-13T18:23:36Z" level=debug msg="**********[AUTOSCALER MAIN LOOP]**********" time="2018-06-13T18:23:37Z" level=debug msg="**********[START NODEGROUP default]**********" time="2018-06-13T18:23:37Z" level=info msg="pods total: 48" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="nodes remaining total: 18" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="cordoned nodes remaining total: 0" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="nodes remaining untainted: 12" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="nodes remaining tainted: 6" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="cpu: 110.43333333333334, memory: 82.31950436831336" nodegroup=default time="2018-06-13T18:23:37Z" level=debug msg="Unlocking scale lock" time="2018-06-13T18:23:37Z" level=debug msg="Delta: 5" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="increasing nodes exceeds maximum (18). Clamping add amount to (0)" time="2018-06-13T18:23:37Z" level=warning msg="Scale up delta is less than or equal to 0 after clamping: 0" time="2018-06-13T18:23:37Z" level=debug msg="DeltaScaled: 0" nodegroup=default time="2018-06-13T18:23:37Z" level=debug msg="Scaling took a total of 110.042288ms"