[X] I had searched in the DSIP and found no similar DSIP.
Motivation
Right now, ds will use master slot to calculate the command slot, and use worker group mapping to select the worker to dispatched the task to worker.
The problem is the code in master is difficult to maintain, there are rise a lot of bug related to the node manager. This PR is aim to refactor the ServerNodeManager and split the code in different component.
Design Detail
The design looks like below:
ClusterManager: used to manage the metadata of the whole clusters include master clusters/worker clusters.
MasterClusters: used to manage the metadata of the master clusters.
WorkerCluster: used to manage the metadata of the worker clusters, include the worker group mapping.
The key point is split the business code from registry, the business code don't need to take care of the registry component.
Search before asking
Motivation
Right now, ds will use master slot to calculate the command slot, and use worker group mapping to select the worker to dispatched the task to worker.
The problem is the code in master is difficult to maintain, there are rise a lot of bug related to the node manager. This PR is aim to refactor the ServerNodeManager and split the code in different component.
Design Detail
The design looks like below:
ClusterManager: used to manage the metadata of the whole clusters include master clusters/worker clusters. MasterClusters: used to manage the metadata of the master clusters. WorkerCluster: used to manage the metadata of the worker clusters, include the worker group mapping.
The key point is split the business code from registry, the business code don't need to take care of the registry component.
Compatibility, Deprecation, and Migration Plan
Compatibility
Test Plan
Test by UT and E2E
Code of Conduct