Closed adamnovak closed 1 week ago
It might be that the Kubernetes batch system is the only one with legitimate failure modes (like the user not having a config file) during construction.
We think this might not be possible anymore in the soon-to-be-merged new AWS job store.
I have a workflow that imports a bunch of files and then operates on them. I tried to run it on Kubernetes, but I didn't have my config set up properly, so the Kubernetes batch system threw an error during construction.
Problem 1: The Toil command didn't actually exit. It printed this and just sat there forever:
Problem 2: When I killed it and ran the same command with
--restart
, to try and re-use the imported files, Toil thought the run was already finished, but then failed to get the return value from the root job (because it wasn't finished).We need to handle the case of a batch system that won't construct better. Ideally we should try to construct the batch system before letting user code do any file imports, to fail fast. We definitely need to fail properly, instead of just sitting there. And it would be nice if we left the job store in some kind of consistent state, or at least a state we knew how to recognize and complain cogently about later when trying to restart.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-450