When get_status is called, any conn.run() commands are run through a new method run_remote_command_with_reconnect.
A new method get_ssh_conn_status is run in async with get_status. The new method polls the ssh connection with a test command and reconnects on the test command's failure.
I have checked this rather manually on my laptop (2021 Mac M1 Pro, running MacOS Ventura 13.3.1 [22E261] ) by (i) running a trivial sample workflow which is disptached to Perlmutter and (ii) turning off my wifi connection once in poll slurm and (iii) Turning it back on again once the debug level logger message:
appears. I can confirm that the ssh connection becomes reestablished and the workflow runs to completion without error.
Some things to note are
This code only checks for connection drops in _poll_slurm while connection can drop and cause hangs at any call to conn.run(). If we wanted to be super tight, we should use the new method run_remote_command_with_reconnect in place of any conn.run() command
Does this code also check out with the plain ssh executor? Let's discuss.
[ ] I have added the tests to cover my changes.
[ ] I have updated the documentation, VERSION, and CHANGELOG accordingly.
This PR implements feature request #64 .
When in _poll_slurm, we now do two things:
When
get_status
is called, anyconn.run()
commands are run through a new methodrun_remote_command_with_reconnect
.A new method
get_ssh_conn_status
is run in async withget_status
. The new method polls the ssh connection with a test command and reconnects on the test command's failure.I have checked this rather manually on my laptop (2021 Mac M1 Pro, running MacOS Ventura 13.3.1 [22E261] ) by (i) running a trivial sample workflow which is disptached to Perlmutter and (ii) turning off my wifi connection once in poll slurm and (iii) Turning it back on again once the debug level logger message:
appears. I can confirm that the ssh connection becomes reestablished and the workflow runs to completion without error.
Some things to note are
This code only checks for connection drops in
_poll_slurm
while connection can drop and cause hangs at any call toconn.run()
. If we wanted to be super tight, we should use the new methodrun_remote_command_with_reconnect
in place of any conn.run() commandDoes this code also check out with the plain ssh executor? Let's discuss.