GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

Concurrent reconcile calls #427

Open shashken opened 3 years ago

shashken commented 3 years ago

In the setup of the Reconciler We use multiple 'For' and 'Owns' from the controller-runtime package.

I think this is causing a behavior (not sure if strange or by design) when reconcile is called for every change/event in those components.

I think this is causing the reconcile to run for every change, and every time reconcile is called its returning a Requeue result that makes the reconcile function run again, this causes the reconcile function to have an increasing number of concurrent runs (every time its triggered by a change it will keep running due to the requeue result).

I saw this when running the operator with a job for some time when I see a lot of duplicates in the log lines for the job. The impact I saw (which led me to investigate this) was to the savepoint triggers, Now it makes sense for me why I saw multiple SPs triggered simultaneously (And in my PR i actually mitigated the symptom and not the root cause).

I would really appreciate some feedback here as my go background is not as strong as the other languages and especially the usage of the https://github.com/kubernetes-sigs/controller-runtime package. @functicons @elanv this might help with your current effort. #420

shashken commented 3 years ago

bump :)

shashken commented 3 years ago

I'd love a second opinion on this. @functicons @elanv

elanv commented 3 years ago

The reconciliation loop pattern is the core of Kubernetes system. It is because the operator should watch every change events of its own resources to make actual state to desired state described in spec. Therefore I think the operator's behaviors like now seems natural. However, there seems to be room for more optimization of the event handling routine as you said, so I think it would be better if it can be improved.

shashken commented 3 years ago

Thanks for the response @elanv .

Now every time a resource changes and the reconcile finishes, it automatically requeues another reconciliation. causing the parallel amount of reconciles to always increase. I think its healthy to trigger reconcile every time a watched resource changes, but I don't think those reconciles be requeued Also maybe we should limit the amount of concurrent reconciles running in a job context? maybe features like RateLimiter or MaxOfRateLimiter can help? (not sure if I understood them perfectly)

elanv commented 3 years ago

Even so, the reconciliation routines for one object will not run in parallel. However, it seems worth to consider about allowing rate limit configuration via operator execution options.