Having just had an experiment trashed by a transient error (no disk space!) triggering the current error handling routine (if N errors happen in a row, kill the tester), I'm proposing a change of error policy to something like this:
When an error happens, we re-attempt immediately.
Next time it happens, we wait 1 second before re-attempting.
Next time, we wait 2 seconds, 4, etc... until we reach some sort of upper limit, at which point we just keep waiting for that amount of time.
If we ever decouple machines and runners (per #71), or need to start/stop machines to perform updates (per #75) then the backoff could happen at a higher level, effectively being a timer for when the machine will next be considered for dispatch to a runner.
Having just had an experiment trashed by a transient error (no disk space!) triggering the current error handling routine (if N errors happen in a row, kill the tester), I'm proposing a change of error policy to something like this:
If we ever decouple machines and runners (per #71), or need to start/stop machines to perform updates (per #75) then the backoff could happen at a higher level, effectively being a timer for when the machine will next be considered for dispatch to a runner.