PolishookDavid / LAST_OCS

Code controling the LAST project Observatory
0 stars 0 forks source link

more robust and automatic reconnection to orphan slaves #25

Closed EastEriq closed 4 days ago

EastEriq commented 1 month ago

considering all possible scenarios:

Check with #24. This new paradigm should augment/replace current unitCS methods like Unit.connectSlave, be integrated in Unit.connect and Unit.checkWholeUnit, and not necessarily using just the core .spawn and .connect of SpawnedMatlab as it does now.

EastEriq commented 1 week ago

SpawnedMatlab.terminate calls .kill, and also tries to kill by listeners too, but only if SpawnedMatlab.Messenger was created. OTOH, if the Messenger was not created, we don't have the port information to search for listeners.

So the logic for killing zombies involves first trying 'Unit.connectSlaves', which creates the slaves and sets their messengers, and then try either co .kill or .terminate them. There seems to be room for doing better.

EastEriq commented 5 days ago

Tested as effective if the master session is killed without terminating the slaves (the next Unit.connect detects the dangling slaves and reconnects to them). Still with the glitch, the multipanel monitor sees the slaves as offline, after the reconnection. I have to check what is happening to the MasterResponder in each slave.

EastEriq commented 4 days ago

Not exactly sure what is happening, I suspect a race on recreating Messenger objects. It seems that the MasterResponder on the slave side may be invalidated when calling connectSlave() on a slave already sane, This varies from time to time, on different hosts and perhaps with some relation wth the master or slaves being silent or displaying in an xterm.

I.e.

>> Unit.Slave(1).terminate            
>> Unit.connectSlave(1)               
{14:03:29.832|obs.unitCS[01]} spawning slave 1
{14:03:56.183|obs.util.SpawnedMatlab[01_slave_1]} 01_slave_1 connected and initialized
>> Unit.Slave(1).Responder.areYouThere
ans =
  logical
   1
>> Unit.connectSlave(1)               
{14:04:20.745|obs.unitCS[01]} slave 1 already exists, will try to reconnect
{14:04:24.013|obs.util.SpawnedMatlab[01_slave_1]} 01_slave_1 connected and initialized
>> Unit.Slave(1).Responder.areYouThere
ans =
  logical
   1

this time was ok but other times fails. If the MasterResponder does not answer, slave information is not shown in Multipanel. Usually the situation is saved by issuing an additional Unit.Slave(1:4).connect, but that can't be considered a solution, because it is unclear what is the cause and why it works.

EastEriq commented 4 days ago

Seems that things have improved with the last commits which don't recreate the slave Units (and hence do not risk to delete the messenger objects their remote classes refer to, i.e. the MasterResponder). I think I can close.

[In fact do we really need remote classes in the slave Units? I never fully populated them, and any non interactive attempt to use them would get deep into the rabbit hole of the deadlock callback catastrophe, so it is better to stay away from the idea of building on them]