At the moment, whenever a SSH connection drops, the tester makes no attempt to reconnect it, and will just keep hitting errors (usually EOF, I think). This means that a transient SSH drop eventually takes out the whole experiment.
The simplest solution would be for the SSH runner to check to see if the connection is live when starting sessions, and restart it if not. A nicer medium-term solution would be for there to be a machine supervisor goroutine that locks out dead machines for a while and then retries (indeed, this would be a useful helper for when runners find a large number of errors, and would tie into the proposed decoupling in #71 if the lockout causes other machines on the runner to fill in the lost space).
At the moment, whenever a SSH connection drops, the tester makes no attempt to reconnect it, and will just keep hitting errors (usually EOF, I think). This means that a transient SSH drop eventually takes out the whole experiment.
The simplest solution would be for the SSH runner to check to see if the connection is live when starting sessions, and restart it if not. A nicer medium-term solution would be for there to be a machine supervisor goroutine that locks out dead machines for a while and then retries (indeed, this would be a useful helper for when runners find a large number of errors, and would tie into the proposed decoupling in #71 if the lockout causes other machines on the runner to fill in the lost space).