OpenFabrics / sunfish_library_reference

The core Sunfish implementation
BSD 3-Clause "New" or "Revised" License
2 stars 4 forks source link

Handle scenario where an Agent restarts and performs the registration multiple times. #13

Open mgazz opened 5 months ago

mgazz commented 5 months ago

Currently, if an Agent restarts multiple times:

Define a correct procedure to handle the internal state in case of multiple registration.

rherrell commented 4 months ago

Don't have all the answers, but some helpful concepts from Gen-Z: a) hardware managers (FMs) that feed the Agent the inventory and status of the fabric components they are managing apply some form of FM created UUID that identify the FM instance that created the inventory. The Agent includes a form of that UUID in a Fabric ID that the Agent includes when it registers with Sunfish Core and sends the aggregated Redfish model for the resources reported by the Agent. This ties a Sunfish fabric model to a set of specific instances of FMs managing the associated hardware.
b) Each individual resource reported by the FM to the Agent has a unique UUID that is fixed to the hardware components. c) An Agent that restarts can retrieve the current state of the FMs' resources. If the FMs have not reset, each will return the same FM instance UUID to the restarted Agent. When the Agent re-uses those same UUIDs to generate a Fabric ID, it should come up with the same Fabric ID as last time.
d) When the restarted Agent attempts to register the same Fabric ID and an inventory that contains (mostly) already present hardware UUIDs, this is a major tip-off to Sunfish Core that this 'new Agent' is managing an existing fabric and not a totally new version. e) If it had been the FMs that restarted and not the Agent, the FM instance UUID will change and the Agent can detect the mismatch when the FM registers with the Agent. Since the Agent already has an inventory containing most or all of the same hardware UUIDs, the Agent can map the new FM UUID to the original FM UUID that the Agent and Sunfish Core are going to continue to use as a handle for communications involving these resources.
f) when the FM restarts, it has to re-discover the fabric components in its management domain. It is pretty obvious if the FM is crawling out a running fabric (components in 'managed' states) vs a cold fabric (un-managed components). The FM can alert the Agent that it is reporting a 'cold start' fabric inventory. g) The Agent and/or Sunfish Core need to have checks in their FM and Agent registration process (respectively) to detect an upload of an existing fabric inventory and status, and handle both scenarios unambiguously. h) If the hardware of the fabric was not reset, the FM and the Agent shall transfer the current fabric inventory and state to Sunfish Core, preferably as a series of ResourceUpdated Events that eventually bring the resource models and inventory for all three entities (Sunfish Core, Agent, FM) into alignment with the hardware truth. Sunfish Core shall issue the appropriate update Events to all correspondingly registered clients.