OpenFabrics / sunfish_library_reference

The core Sunfish implementation
BSD 3-Clause "New" or "Revised" License
2 stars 4 forks source link

Sunfish: Agent Failover / Recovery #18

Open christian-pinto opened 4 months ago

christian-pinto commented 4 months ago

This issue tracks the design of the Sunfish Agent failover/recovery procedure.

Assumptions:

  1. The failing agent was already registered with Sunfish
  2. A failing agent will restart with either a backup of its state or from scratch. In both cases we assume the identifiers of the resources it manages to be unique and not changing across agent's restarts.

Discussion

There are a number of possible agent failures to be considered here:

  1. The Agent is registered with Sunfish, it received its UUID but it has not advertised any of the resources under its management.
  2. The Agent is registered with Sunfish, it received its UUID and it has already advertised some (or all) of the resources under its management.

For both cases we should consider the the possibility for the agent to be able to resume its state (i.e., recover its UUID and resources state) and the case where the agent is restarting from scratch and the entire state is lost.

Case (1) is perhaps the easier to manage. The Sunfish core will have a record of the agent only in the form of the AggregationSource object created and populated with the Agent UUID with no resources listed in the ResourcesAccessed array. If the agent is able to recover its Sunfish UUID, it will restart the registration procedure (i.e., send an AggregationSourceDiscovered event) including its previous UUID in the Context field of the event payload. In this case the Sunfish core will simply verify the existing AggregationSource and reply with an "ACK" (the entire AggregationSource object ) to signify the connection with the agent is re-established. Normal operations can then resume. If the agent restarts with no previous state, it will trigger the registration, leaving the Context of the registration event empty. In this case the Sunfish core will treat this as a new registration, generate a UUID etc. The pre-existing AggregationSource will be marked (somehow) as stale and removed by some (still undefined) garbage collection operation.

Case (2) is instead more complex, because one might want to reconcile the state of those resources already present in the sunfish tree. Similar considerations apply, as in the previous case, with respect to the agent starting with previous state or from scratch. In the first case the agent will communicate its pre-existing UUID in the registration event Context field. If not a new UUID will be generated. Once the initial handshake is performed the agent will automatically restart advertising its available resources. If the UUID is recovered at registration, Sunfish will realize the resource is already in its tree and it is already marked as managed by the specific agent. Sunfish will access the resource from the Agent to make sure its version (state, configuration) is on par with the actual state on the agent side. This will be repeated for all resources advertised. If the agent got a new UUID, it will start advertising resources, with the new UUID as context and under the assumption that resources IDs do not change across agent restarts. Sunfish will find the existing resources in its tree marked with a different agent UUID. Sunfish could naively just accept the uuid of the agent has changed, and updates all resources during the discovery with the new UUID. However, how can we differentiate from resources with the same ID that come from an agent that failed and got a new UUID with resources from a totally new agent that for whatever the reason have conflicting names with existing resources? It all probably boils down to: how do we guarantee that agent use unique names for their resources so that we have no conflicts between agents?

christian-pinto commented 4 months ago

@cayton @mjaguil @mgazz I have started working on the agent failover bit see above.