If Runner.submit throws an exception -- for instance because the batch system submission command has failed -- this can kill the whole scheduler (I observed this when using SlurmRunner, but looks like the same thing could happen with other runners too).
The scheduler has some reasonable-looking failure-handling logic in _start_requests, but this only handles the case where an OSError is thrown, whereas a failure in the sbatch (or qsub, or whatever) command is normally going to lead to a CalledProcessError (subclass of SubprocessError).
Thank you for spotting this. I thought that errors raised by subprocesses are subclasses of OSError, turns out it's not the case. Broadening the type of caught exceptions in _start_requests should fix this.
If Runner.submit throws an exception -- for instance because the batch system submission command has failed -- this can kill the whole scheduler (I observed this when using SlurmRunner, but looks like the same thing could happen with other runners too).
The scheduler has some reasonable-looking failure-handling logic in
_start_requests
, but this only handles the case where anOSError
is thrown, whereas a failure in thesbatch
(orqsub
, or whatever) command is normally going to lead to aCalledProcessError
(subclass ofSubprocessError
).