The master and nodes for Gemini currently operate on a master-oriented model, where nodes individually make calls to the master to exchange information. Here is a non-exhaustive list of information that will be transferred:
Ping/heartbeat
Jobs to execute
CPU/memory and other metrics
We may also take the other approach and choose a node-oriented model where a master initiates communication with nodes. Some points about this that I have in mind:
Nodes will need to run a Flask server for the masters to hit. Would have to exchange this information on node startup. This also calls for additional firewall/routing rules.
Unclear how multiple masters would interact with eachother and divide work
Will need to code retries into calls. For example, if a master makes a call to nodes when a job is submitted, if that call fails, the master will have to schedule a retry.
We can also choose to use a hybrid model, where some information is pushed to master, and some information is pushed to node.
The master and nodes for Gemini currently operate on a master-oriented model, where nodes individually make calls to the master to exchange information. Here is a non-exhaustive list of information that will be transferred:
We may also take the other approach and choose a node-oriented model where a master initiates communication with nodes. Some points about this that I have in mind:
We can also choose to use a hybrid model, where some information is pushed to master, and some information is pushed to node.
/cc @ncatelli