[Regression] MaxRetryTimeout should be respected while scaling machineDeployment

himanshu-kun commented 1 year ago

What happened: Currently the retry deadline of 1min is not getting respected, due to which the CA's mcm implementation never gives up and keep trying to scale the machineDeployment as requested by CA's core logic. This leads to CA never removing the ToBeDeletedTaint on the node , and they are considered as upcoming node due to an upstream bug.

Means the pods stay in Pending state.

What you expected to happen: CA mcm implementation should respect MaxRetryTimeout

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

This occurred because of a regression introduced in #160 where retryDeadline is updated on every machineDeployment update failure , leading to infinite deadline. Find the code here . Also , since the machineDeployment is never re-fetched, it always fails update on the apiserver.

Currently #160 has been patched and released till rel-v1.21 , so need to update these patch branches as well.

The following should be part of the PR , which fixes this issue:

[x] remove ToBeDeleted Taint from all nodes PR from upstream (https://github.com/kubernetes/autoscaler/pull/5200)
[x] retry only till maxRetryTimeout and always re-fetch machineDeployment

Environment:

himanshu-kun commented 1 year ago

/assign @rishabh-11 @himanshu-kun

himanshu-kun commented 1 year ago

/priority critical

gardener / autoscaler

[Regression] MaxRetryTimeout should be respected while scaling machineDeployment #213