gardener / autoscaler

Customised fork of cluster-autoscaler to support machine-controller-manager
Apache License 2.0
16 stars 25 forks source link

[Regression] MaxRetryTimeout should be respected while scaling machineDeployment #213

Closed himanshu-kun closed 1 year ago

himanshu-kun commented 1 year ago

What happened: Currently the retry deadline of 1min is not getting respected, due to which the CA's mcm implementation never gives up and keep trying to scale the machineDeployment as requested by CA's core logic. This leads to CA never removing the ToBeDeletedTaint on the node , and they are considered as upcoming node due to an upstream bug.

Means the pods stay in Pending state.

What you expected to happen: CA mcm implementation should respect MaxRetryTimeout

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

This occurred because of a regression introduced in #160 where retryDeadline is updated on every machineDeployment update failure , leading to infinite deadline. Find the code here . Also , since the machineDeployment is never re-fetched, it always fails update on the apiserver.

Currently #160 has been patched and released till rel-v1.21 , so need to update these patch branches as well.

The following should be part of the PR , which fixes this issue:

Environment:

himanshu-kun commented 1 year ago

/assign @rishabh-11 @himanshu-kun

himanshu-kun commented 1 year ago

/priority critical