escalator couldn't scale up with pending pods when daemonsets pods request a significant proportional of node resource

patrickshan commented 5 years ago

Because of the way escalator calculates its resource usage percentage, escalator couldn't scale up node group if daemonsets pods on the node request a significant proportion of node resource. Node group has only one node A which has allocatable resources like this:

Allocatable:
 attachable-volumes-aws-ebs:  25
 cpu:                         1
 ephemeral-storage:           934046556662
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      3062412Ki
 pods:                        60

Pods scheduled on node A:

 Namespace                  Name                            CPU Requests  CPU Limits    Memory Requests  Memory Limits  AGE
  ---------                  ----                            ------------  ----------    ---------------  -------------  ---
  default                    test-f57fd98bf-5xlg4    200m (20%)    0 (0%)        100Mi (3%)       0 (0%)         19m
  default                   test-f57fd98bf-bxnfg    200m (20%)    0 (0%)        100Mi (3%)       0 (0%)         57m
  kube-system                canal-node-mtxg6                250m (25%)    2500m (250%)  0 (0%)           0 (0%)         45m
  kube-system                kube-proxy-nvgr5                200m (20%)    200m (20%)    256Mi (8%)       256Mi (8%)     45m

There are 3 replicas under test deployment and the first two have been scheduled on node A while the last one under Pending state:

$ kubectl get pods -n default test-f57fd98bf-kp2mk
NAME                           READY   STATUS    RESTARTS   AGE
test-f57fd98bf-kp2mk   0/1     Pending   0          37m

And escalator doesn't trigger any scale-up in this case:

DEBU[1563] **********[AUTOSCALER MAIN LOOP]**********
DEBU[1564] **********[START NODEGROUP default]**********
INFO[1564] pods total: 3                                 nodegroup=default
INFO[1564] nodes remaining total: 1                      nodegroup=default
INFO[1564] cordoned nodes remaining total: 0             nodegroup=default
INFO[1564] nodes remaining untainted: 1                  nodegroup=default
INFO[1564] nodes remaining tainted: 0                    nodegroup=default
INFO[1564] Minimum Node: 1                               nodegroup=default
INFO[1564] Maximum Node: 6                               nodegroup=default
INFO[1564] cpu: 60, memory: 10.03130865474665            nodegroup=default
DEBU[1564] Delta: 0                                      nodegroup=default
INFO[1564] No need to scale                              nodegroup=default
INFO[1564] Reaper: There were 0 empty nodes deleted this round  nodegroup=default
DEBU[1564] DeltaScaled: 0                                nodegroup=default
DEBU[1564] Scaling took a total of 711.08356ms

patrickshan commented 5 years ago

This is caused by usage percentage calculation algorithm. Because rather than counting any daemonsets or static pods requests and removing them from final calculation, it currently uses Total pods requests without daemonsets/static pods and Total nodes allocatable resource. This makes the usage percentage a bit smaller than its real usage percentage.

One way to solve this issue without changing escalator is to tune your node group config parameters, especially scale_up_threshold_percent, taint_upper_capacity_threshold_percent and taint_lower_capacity_threshold_percent.

patrickshan commented 5 years ago

updated the document in this PR: https://github.com/atlassian/escalator/pull/156 . Close this issue for now as it could be solved by tuning escalator config.

atlassian / escalator

escalator couldn't scale up with pending pods when daemonsets pods request a significant proportional of node resource #154