Reduce load on kube-apiserver / ETCD on pod eviction for termination machines

How to categorize this issue?

/area robustness /kind bug /priority 3

What happened:

On rolling nodes of a large cluster with about 100 nodes, the max surge was limited to 10%. The termination of a node was waiting for workload to end. Therefore the rolling took more than 12 hours. This means during this time there have been about 10 machines in state Terminating. The machine controller manager tried to evict pods on these nodes with high frequency and produced a lot of load on the kube-apiservers and the ETCD.

In this image you see the client out traffic of the ETCD. Between 15:20 and 15:30 we scaled down the machine-controller-manager and the traffic was reduced immediately.

In the logs of the machine-controller-manager we found lots of throttling messages. Here I show only the eviction call for a single pod (name redacted):

I0412 15:29:36.845890       1 request.go:591] Throttling request took 542.507996ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:29:48.512932       1 request.go:591] Throttling request took 290.637231ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:29:59.562728       1 request.go:591] Throttling request took 338.986248ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:30:09.012570       1 request.go:591] Throttling request took 419.676173ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:30:19.270585       1 request.go:591] Throttling request took 355.943286ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:30:40.369073       1 request.go:591] Throttling request took 251.448724ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:30:51.019605       1 request.go:591] Throttling request took 329.512125ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:00.169268       1 request.go:591] Throttling request took 596.340432ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:05.519530       1 request.go:591] Throttling request took 181.186876ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:14.868701       1 request.go:591] Throttling request took 545.271603ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:23.119681       1 request.go:591] Throttling request took 388.854376ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:31.669300       1 request.go:591] Throttling request took 246.905561ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:40.369208       1 request.go:591] Throttling request took 374.552037ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:44.791522       1 request.go:591] Throttling request took 93.770177ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:59.091661       1 request.go:591] Throttling request took 495.377239ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction

The eviction request is repeated every 5 to 10 seconds.

What you expected to happen:

The machine controller manager should reduce the frequency of pod eviction calls if the termination of a machine takes a long time.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Internal reference: see live issue #1570

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

/cc @kon-angelo

One approach is to introduce backoff in the rate of pod eviction requests based on the amt of throttling for the request. The backoff can be limited, so we don't go over 1min b/w two consecutive requests for a pod. This way the size of the cluster is not a factor we would consider. This makes sense as the size and number of pods in the cluster are not necessarily related. Currently, we lack data which could help us to find the best value for the max interval b/w two requests , so keeping it to 1min can be done.

After discussion , we found out that the live issue was occuring because the PDB for each pod on the draining node were misconfigured . Currently for misconfigured case , we don't attemptEvict again and return err which goes till the drainNode() and we do a ShortRetry . This leads to a lot of load of calling of drainNode() and high load. Generally we end up doing a attemptEvict until drainTimeout happens for a few pods on the node and we don't retry that often, but this was a corner case.

We would also want to deal with many pdbs for a single pod case, as that is also a kind of misconfiguration.

Proposed solution:

in case of misconfigured pdb and many pdbs for single pod , we should return with a medium retry (i.e. 3min)
in other cases, we will attempt evict with a podEvictionRetryInterval (20sec) like we do right now.

Grooming Notes:-

We need to add a metric showing the number of drain failures for a machine and the reason code. This will help us collect data about the reasons for failure in draining a machine. (Not a prereq for the solution)

As seen in live issue #4557, the pod list call for eviction https://github.com/gardener/machine-controller-manager/blob/61ceccd495056e8fc885207610f5f4b6f7652277/pkg/util/provider/drain/drain.go#L317 is done after every 5 sec in case of PDB issues.
We have two approaches:- a.) Remove enqueuing of machine in reconcileClusterMachineKey in case of errors and use queue.AddRateLimited(key) for enqueuing the machine in case of failures. One can adjust the setting of rate-limiter as well for backoff. b.) Introduce exp backoff in drain for retry.
approach 2-a can affect machine creation timeout, so we have two options:- a.) Create a custom error (containing the underlying error and a field to distinguish whether to retry with exp backoff or not) b.) Another thing we can do is to have a delayed queue that respects the timeouts we have in MCM as well as rate limiting.
We can do 3-a and requeue the machine with AddAfter and a time period of drain timeout in case of error in draining the node. We will adjust the rate limiter to start with the lower limit of ShortRetry(5 sec). We will also increase the exponent for the backoff.

Final Decision:- We go with option 4.

the pod list call for eviction is done after every 5 sec in case of PDB issues.

Could it help to start watching pods instead of listing pods repeatedly? Could server side filtering (label or field selectors) be applied in addition?

Repeated list requests are less efficient (might cause significantly more network traffic, especially on the network link API server <-> ETCD) than a long running watch request in combination with server side filtering.

gardener / machine-controller-manager

Reduce load on kube-apiserver / ETCD on pod eviction for termination machines #703