Closed himanshu-kun closed 11 months ago
A soln is to treat the object has been modified
as a special error, and re-push the obj in around 2 to 5 seconds if this is seen. In this time , the cache would be updated as well.
cc @rishabh-11
A PR which ignores events of changes to status if the status is semantically equal in controller-runtime https://github.com/apache/camel-k/pull/3285
Could be worth looking into when working on https://github.com/gardener/machine-controller-manager/issues/724
An alternative could be to use SSA (server side apply). Also check reconstructive-controllers.
google group discussion on this kind of issue -> https://groups.google.com/g/kubebuilder/c/tULj-TRM9ts?pli=1
We saw that we face this problem primarily because of stale cache. Earlier the proposal was to let the cache sync by retrying the machine object after around 2 to 5 seconds
A soln is to treat the object has been modified as a special error, and re-push the obj in around 2 to 5 seconds if this is seen. In this time , the cache would be updated as well.
But then we decided to use WaitForCacheSync
function. Currently since the problem is seen only for machine controller so we'll deal with it there by adding WaitForCacheSync
right at the beginning of reconcileClusterMachine
func.
How to categorize this issue?
/area robustness /kind bug /priority 2
What happened: We have seen cases where the update of machine obj fails due to
the object has been modified; please apply your changes to the latest version and try again
errors. ExampleThis could lead to our
ShortRetry
orMediumRetry
kick in for the machine object and so the next reconcile could happen in min if not seconds. (here its around 20sec after which machine conditions started updating) . This could lead to machine conditions not updating quickly or machine obj not gettingRunning
quickly.This quick push in the queue is happening because we push machine objects currently on
status
updates also. Although in small clusters we see problems like described above , but in big clusters , it is helpful as with many machines in the queue, the machine object's turn could come quite late, so a quick push to the queue helps reducing that time.What you expected to happen: Next machine reconcile not delayed because of
object has been modified
errors.How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
):