bartongroup / slivka

http://bartongroup.github.io/slivka/
Apache License 2.0
7 stars 3 forks source link

Scheduler exits if a batch-queue submission fails #119

Closed dasmoth closed 2 years ago

dasmoth commented 2 years ago

If Runner.submit throws an exception -- for instance because the batch system submission command has failed -- this can kill the whole scheduler (I observed this when using SlurmRunner, but looks like the same thing could happen with other runners too).

The scheduler has some reasonable-looking failure-handling logic in _start_requests, but this only handles the case where an OSError is thrown, whereas a failure in the sbatch (or qsub, or whatever) command is normally going to lead to a CalledProcessError (subclass of SubprocessError).

warownia1 commented 2 years ago

Thank you for spotting this. I thought that errors raised by subprocesses are subclasses of OSError, turns out it's not the case. Broadening the type of caught exceptions in _start_requests should fix this.