eclipse-symphony / symphony

Symphony project
MIT License
37 stars 29 forks source link

Revisit parallelism in solution reconciler #248

Open msftcoderdjw opened 6 months ago

msftcoderdjw commented 6 months ago

https://github.com/eclipse-symphony/symphony/blob/17392d8c1b1decc110ca3cef28a8675e4dd45c9f/api/pkg/apis/v1alpha1/managers/solution/solution-manager.go#L220-L222

Now global lock is used in reconcile function.

Haishi2016 commented 6 months ago

Several thoughts: 1) we may want to consider using an Actor framework (such as Akka or Orleans) to allow individual instance/target to drive towards its own desired state in parallel; 2) for parallelism of job manager, we should use platform to scale it out - such as putting job vendor in a separate pod and scale the pod to multiple instances, which essentially implement the competing consumer pattern. 3) Also note that the solution manager is also used on agent side for reconciliation.

msftcoderdjw commented 6 months ago

Several thoughts: 1) we may want to consider using an Actor framework (such as Akka or Orleans) to allow individual instance/target to drive towards its own desired state in parallel; 2) for parallelism of job manager, we should use platform to scale it out - such as putting job vendor in a separate pod and scale the pod to multiple instances, which essentially implement the competing consumer pattern. 3) Also note that the solution manager is also used on agent side for reconciliation.

I agree. I think we can divide the fixes into several phases.

  1. Short term (single pod): we need to use fine-granularity lock (lock per deployment id) to replace the global lock in solution-manager. Then we can make sure multi deployments can be reconciled concurrently. This is also applicable for agent side in standalone deployment.
  2. Long term (multiple pods): consider evaluating actor framework, such as Akka or Orleans. (The alternative is to leverage distributed lock when scaling out the pod. I don't prefer to the distributed lock if the mature actor framework can avoid this.)
msftcoderdjw commented 4 months ago

Assign to @FireDefend for short term fix.