gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
257 stars 117 forks source link

Retry for Machine reconciliation happening quicker than cache update leading to `the object has been modified` errors #767

Closed himanshu-kun closed 11 months ago

himanshu-kun commented 1 year ago

How to categorize this issue?

/area robustness /kind bug /priority 2

What happened: We have seen cases where the update of machine obj fails due to the object has been modified; please apply your changes to the latest version and try again errors. Example

I0113 11:04:02.491801       1 machine.go:509] Machine labels/annotations UPDATE for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2"

I0113 11:04:02.790670       1 core.go:203] Machine get request has been processed successfully for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2"
I0113 11:04:02.822071       1 machine.go:537] Machine/status UPDATE for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2" during creation
I0113 11:04:03.120387       1 core.go:203] Machine get request has been processed successfully for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2"
W0113 11:04:03.147829       1 machine.go:535] Machine/status UPDATE failed for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2". Retrying, error: Operation cannot be fulfilled on machines.machine.sapcloud.io "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2": the object has been modified; please apply your changes to the latest version and try again
W0113 11:04:03.147829       1 machine.go:535] Machine/status UPDATE failed for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2". Retrying, error: Operation cannot be fulfilled on machines.machine.sapcloud.io "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2": the object has been modified; please apply your changes to the latest version and try again
I0113 11:04:25.815790       1 machine_util.go:628] Conditions of Machine "shoot--it--tmlf6-sy3-worker-1-z1-78f67-xfvch" with providerID "azure:///northeurope/shoot--it--tmlf6-sy3-worker-1-z1-78f67-xfvch" and backing node "shoot--it--tmlf6-sy3-worker-1-z1-78f67-xfvch" are changing

This could lead to our ShortRetry or MediumRetry kick in for the machine object and so the next reconcile could happen in min if not seconds. (here its around 20sec after which machine conditions started updating) . This could lead to machine conditions not updating quickly or machine obj not getting Running quickly.

This quick push in the queue is happening because we push machine objects currently on status updates also. Although in small clusters we see problems like described above , but in big clusters , it is helpful as with many machines in the queue, the machine object's turn could come quite late, so a quick push to the queue helps reducing that time.

What you expected to happen: Next machine reconcile not delayed because of object has been modified errors.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

himanshu-kun commented 1 year ago

A soln is to treat the object has been modified as a special error, and re-push the obj in around 2 to 5 seconds if this is seen. In this time , the cache would be updated as well. cc @rishabh-11

himanshu-kun commented 1 year ago

A PR which ignores events of changes to status if the status is semantically equal in controller-runtime https://github.com/apache/camel-k/pull/3285

Could be worth looking into when working on https://github.com/gardener/machine-controller-manager/issues/724

unmarshall commented 1 year ago

An alternative could be to use SSA (server side apply). Also check reconstructive-controllers.

himanshu-kun commented 1 year ago

google group discussion on this kind of issue -> https://groups.google.com/g/kubebuilder/c/tULj-TRM9ts?pli=1

himanshu-kun commented 1 year ago

Solution decided post grooming

We saw that we face this problem primarily because of stale cache. Earlier the proposal was to let the cache sync by retrying the machine object after around 2 to 5 seconds

A soln is to treat the object has been modified as a special error, and re-push the obj in around 2 to 5 seconds if this is seen. In this time , the cache would be updated as well.

But then we decided to use WaitForCacheSync function. Currently since the problem is seen only for machine controller so we'll deal with it there by adding WaitForCacheSync right at the beginning of reconcileClusterMachine func.