Runner script failure diagnosis is tedious

dchackett commented 7 years ago

When a runner script fails, it prints some output (which ends up in the taxi log file) and returns some integer error code. The error code can be interpreted by looking at the runner script, but this is annoying. We would like more accessible information to diagnose why a given run_script task failed, preferably stored in the failed task itself (so no hunting around is required).

Bad solution: Taxis have tables (dicts) of error codes for each script to write in to failed tasks.

Good solution?: Instead of calling runner scripts from the command line, runner scripts are imported as modules (classes). Instead of making an external os.system(..., shell=True) call, taxi.py will simply call the appropriate runner module. This makes it vastly easier to communicate diagnostics from the runner to the taxi.

This approach could simultaneously solve #5, eliminating taxi.py's ability to run arbitrary shell commands based only off the content of the DB.

Would require encapsulating ("classing up") the functional part of the runner scripts, which can be done in a backwards-compatible way using if __name__ == "__main__": (which also allows the runner scripts to remain useful as stand-alones). As a pleasant side effect, runner scripts would be much easier to test.

Could also slightly reduce overhead by having fewer instances of python running, and no time cost of starting up python for every runner script (although this is probably negligible to begin with). On the other hand, this approach is slightly more vulnerable to memory garbage pileup (but probably not, if encapsulation is done well).

Potential issue: currently, if the runner scripts crash, taxi.py just marks them as failed and moves on. The same functionality can be achieved with runner modules like

try:
  runner.run()
except:
  runner_failed()

...but blanket exceptions are skeezy.

etneil commented 7 years ago

I've made it into this part of the code in my redesign work (#22), so I've been thinking about this issue. I like the "runners as Python functions/objects" design because I do think it solves #5 (security issues) nicely too, and it might also solve #23.

This also improves some aethestics about the Dispatch itself, I think. Right now things are organized around a "task_type", which can be "die", "copy", or "run_script" (or "spawn" and "respawn", but I think my refactoring around the Pool database obviates those last two.) Really, this could just be "run_script", since "copy" could certainly be implemented as a special case of "call arbitrary binary with arbitrary arguments from the command line" (but that leads us back to #5.)

Instead, every task stored in the Dispatch could be a JSON payload that reconstructs a Runner object of some kind. This would accommodate arbitrary sets of arguments, and would remove the security issue since parsing JSON doesn't allow for execution of malicious code (particularly if we only allow it to reconstruct Runner objects.) At the top level we define the basic Runner class, and then more application-specific instances can be written and imported by the user as needed.

Also: I think the blanket exception for the case you've indicated is actually perfectly fine: we want to know if runner.run() fails for any reason. Blanket exceptions are only a bad thing if you use them where you really just want to catch a particular exception, and then your code starts to silently fail for an unrelated reason.

I'll add a top-level "runner.py" to my modularize refactor and modify Dispatch to use it. (I won't try to rewrite every single existing runner in the new format just yet; that can wait until I've merged and we can both look at them.)

dchackett commented 6 years ago

Implemented the runners as objects/JSON payload scheme. We can now throw custom error messages with descriptive output which can be handled by the calling taxi.

dchackett / taxi

Runner script failure diagnosis is tedious #16