AgnostiqHQ / covalent-slurm-plugin

Executor plugin interfacing Covalent with Slurm
https://covalent.xyz
Apache License 2.0
27 stars 6 forks source link

Reconnect SSH after disconnect #71

Open jackbaker1001 opened 1 year ago

jackbaker1001 commented 1 year ago

This PR implements feature request #64 .

When in _poll_slurm, we now do two things:

  1. When get_status is called, any conn.run() commands are run through a new method run_remote_command_with_reconnect.

  2. A new method get_ssh_conn_status is run in async with get_status. The new method polls the ssh connection with a test command and reconnects on the test command's failure.

I have checked this rather manually on my laptop (2021 Mac M1 Pro, running MacOS Ventura 13.3.1 [22E261] ) by (i) running a trivial sample workflow which is disptached to Perlmutter and (ii) turning off my wifi connection once in poll slurm and (iii) Turning it back on again once the debug level logger message:

app_log.debug( f"SSH connection error {e}, reconnecting, attempt {self.ssh_connect_retries}...")

appears. I can confirm that the ssh connection becomes reestablished and the workflow runs to completion without error.

Some things to note are

  1. This code only checks for connection drops in _poll_slurm while connection can drop and cause hangs at any call to conn.run(). If we wanted to be super tight, we should use the new method run_remote_command_with_reconnect in place of any conn.run() command

  2. Does this code also check out with the plain ssh executor? Let's discuss.

CLAassistant commented 1 year ago

CLA assistant check
All committers have signed the CLA.