SMART-Lab / smartdispatch

An easy to use job launcher for supercomputers with PBS compatible job manager.
Do What The F*ck You Want To Public License
34 stars 18 forks source link

Advanced handling of command termination when autoresume is engaged #151

Open ddtm opened 7 years ago

ddtm commented 7 years ago

Right now we are adding the running command back to pending list unconditionally. This may not be desirable in some cases, e.g. when the termination triggers checkpointing which can potentially fail and in this scenario it's better to put the command into the finished list specifying the returned error code.

A snippet to illustrate the idea:

if sigterm_handler.proc is not None:
    error_code = sigterm_handler.proc.wait()
 if sigterm_handler.command is not None:
    if error_code == 0:  # The command was terminated successfully.
        command_manager.set_running_command_as_pending(sigterm_handler.command)
    else:
        command_manager.set_running_command_as_finished(sigterm_handler.command, error_code)

Relevant code: smartdispatch/workers/base_worker.py:54