What happened:
Currently the retry deadline of 1min is not getting respected, due to which the CA's mcm implementation never gives up and keep trying to scale the machineDeployment as requested by CA's core logic.
This leads to CA never removing the ToBeDeletedTaint on the node , and they are considered as upcoming node due to an upstream bug.
Means the pods stay in Pending state.
What you expected to happen:
CA mcm implementation should respect MaxRetryTimeout
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
This occurred because of a regression introduced in #160 where retryDeadline is updated on every machineDeployment update failure , leading to infinite deadline. Find the code here . Also , since the machineDeployment is never re-fetched, it always fails update on the apiserver.
Currently #160 has been patched and released till rel-v1.21 , so need to update these patch branches as well.
The following should be part of the PR , which fixes this issue:
What happened: Currently the retry deadline of 1min is not getting respected, due to which the CA's mcm implementation never gives up and keep trying to scale the machineDeployment as requested by CA's core logic. This leads to CA never removing the
ToBeDeletedTaint
on the node , and they are considered as upcoming node due to an upstream bug.Means the pods stay in
Pending
state.What you expected to happen: CA mcm implementation should respect MaxRetryTimeout
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
This occurred because of a regression introduced in #160 where
retryDeadline
is updated on every machineDeployment update failure , leading to infinite deadline. Find the code here . Also , since the machineDeployment is never re-fetched, it always fails update on the apiserver.Currently #160 has been patched and released till rel-v1.21 , so need to update these patch branches as well.
The following should be part of the PR , which fixes this issue:
ToBeDeleted
Taint from all nodes PR from upstream (https://github.com/kubernetes/autoscaler/pull/5200)maxRetryTimeout
and always re-fetch machineDeploymentEnvironment: