DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
897 stars 241 forks source link

Failure during batch system construction hangs Toil, corrupts job store preventing a restart #2852

Closed adamnovak closed 1 week ago

adamnovak commented 4 years ago

I have a workflow that imports a bunch of files and then operates on them. I tried to run it on Kubernetes, but I didn't have my config set up properly, so the Kubernetes batch system threw an error during construction.

Problem 1: The Toil command didn't actually exit. It printed this and just sat there forever:

Traceback (most recent call last):
  File "/venv2/bin/toil-vg", line 8, in <module>
    sys.exit(main())
  File "/venv2/local/lib/python2.7/site-packages/toil_vg/vg_toil.py", line 403, in main
    construct_main(context, args)
  File "/venv2/local/lib/python2.7/site-packages/toil_vg/vg_construct.py", line 1799, in construct_main
    toil.start(init_job)
  File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 778, in start
    self._batchSystem = self.createBatchSystem(self.config)
  File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 926, in createBatchSystem
    return batchSystemClass(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/kubernetes.py", line 79, in __init__
    raise RuntimeError('Could not load Kubernetes configuration. Does ~/.kube/config or $KUBECONFIG exist?')
RuntimeError: Could not load Kubernetes configuration. Does ~/.kube/config or $KUBECONFIG exist?

Problem 2: When I killed it and ran the same command with --restart, to try and re-use the imported files, Toil thought the run was already finished, but then failed to get the return value from the root job (because it wasn't finished).

adamnovak-make-graphs-6pwfc 2019-11-12 19:29:09,125 MainThread WARNING toil.common: Requested restart but the workflow has already been completed; allowing exports to rerun.
Traceback (most recent call last):
  File "/venv2/bin/toil-vg", line 8, in <module>
    sys.exit(main())
  File "/venv2/local/lib/python2.7/site-packages/toil_vg/vg_toil.py", line 403, in main
    construct_main(context, args)
  File "/venv2/local/lib/python2.7/site-packages/toil_vg/vg_construct.py", line 1801, in construct_main
    toil.restart()
  File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 821, in restart
    return self._jobStore.getRootJobReturnValue()
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/abstractJobStore.py", line 226, in getRootJobReturnValue
    with self.readSharedFileStream('rootJobReturnValue') as fH:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 589, in readSharedFileStream
    info = self.FileInfo.loadOrFail(jobStoreFileID, customName=sharedFileName)
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 942, in loadOrFail
    raise NoSuchFileException(jobStoreFileID, customName=customName)
toil.jobStores.abstractJobStore.NoSuchFileException: File 'rootJobReturnValue' (de90cdf4-4065-5763-8ae2-b71758c6f931) does not exist.

We need to handle the case of a batch system that won't construct better. Ideally we should try to construct the batch system before letting user code do any file imports, to fail fast. We definitely need to fail properly, instead of just sitting there. And it would be nice if we left the job store in some kind of consistent state, or at least a state we knew how to recognize and complain cogently about later when trying to restart.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-450

adamnovak commented 4 years ago

It might be that the Kubernetes batch system is the only one with legitimate failure modes (like the user not having a config file) during construction.

adamnovak commented 1 week ago

We think this might not be possible anymore in the soon-to-be-merged new AWS job store.