Open MartinWeindel opened 2 years ago
One approach is to introduce backoff in the rate of pod eviction requests based on the amt of throttling for the request. The backoff can be limited, so we don't go over 1min
b/w two consecutive requests for a pod. This way the size of the cluster is not a factor we would consider. This makes sense as the size and number of pods in the cluster are not necessarily related.
Currently, we lack data which could help us to find the best value for the max interval b/w two requests , so keeping it to 1min can be done.
After discussion , we found out that the live issue was occuring because the PDB for each pod on the draining node were misconfigured
. Currently for misconfigured
case , we don't attemptEvict
again and return err which goes till the drainNode()
and we do a ShortRetry
. This leads to a lot of load of calling of drainNode()
and high load.
Generally we end up doing a attemptEvict
until drainTimeout
happens for a few pods on the node and we don't retry that often, but this was a corner case.
We would also want to deal with many pdbs for a single pod
case, as that is also a kind of misconfiguration.
Proposed solution:
misconfigured pdb
and many pdbs for single pod
, we should return with a medium retry
(i.e. 3min)podEvictionRetryInterval
(20sec) like we do right now.Grooming Notes:-
We need to add a metric showing the number of drain failures for a machine and the reason code. This will help us collect data about the reasons for failure in draining a machine. (Not a prereq for the solution)
reconcileClusterMachineKey
in case of errors and use queue.AddRateLimited(key)
for enqueuing the machine in case of failures. One can adjust the setting of rate-limiter as well for backoff.
b.) Introduce exp backoff in drain for retry.AddAfter
and a time period of drain timeout in case of error in draining the node. We will adjust the rate limiter to start with the lower limit of ShortRetry(5 sec)
. We will also increase the exponent for the backoff.Final Decision:- We go with option 4.
the pod list call for eviction is done after every 5 sec in case of PDB issues.
Could it help to start watching pods instead of listing pods repeatedly? Could server side filtering (label or field selectors) be applied in addition?
Repeated list requests are less efficient (might cause significantly more network traffic, especially on the network link API server <-> ETCD) than a long running watch request in combination with server side filtering.
How to categorize this issue?
/area robustness /kind bug /priority 3
What happened:
On rolling nodes of a large cluster with about 100 nodes, the max surge was limited to 10%. The termination of a node was waiting for workload to end. Therefore the rolling took more than 12 hours. This means during this time there have been about 10 machines in state
Terminating
. The machine controller manager tried to evict pods on these nodes with high frequency and produced a lot of load on the kube-apiservers and the ETCD.In this image you see the client out traffic of the ETCD. Between 15:20 and 15:30 we scaled down the machine-controller-manager and the traffic was reduced immediately.
In the logs of the machine-controller-manager we found lots of throttling messages. Here I show only the eviction call for a single pod (name redacted):
The eviction request is repeated every 5 to 10 seconds.
What you expected to happen:
The machine controller manager should reduce the frequency of pod eviction calls if the termination of a machine takes a long time.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Internal reference: see live issue #1570
Environment:
kubectl version
):/cc @kon-angelo