DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

Job restart does not take in any new edits made to cwl #4298

Open boyangzhao opened 1 year ago

boyangzhao commented 1 year ago

The --restart option does not seem to take into account any edits made to the cwl, and it seems there is some copy of the original cwl stored somewhere in the job store that it uses when restarting. Is there a way to instruct toil to take in the newly edited cwl instead of the cached one in job store?

For example, if I run something like toil-cwl-runner --clean=never --jobStore file:<path> helloworld.cwl helloworld.str.job.yaml

The say the helloworld.cwl had a bug, and the job fails. If I go in and fix the bug in helloworld.cwl, and then run with

toil-cwl-runner --restart --jobStore file:<path> helloworld.cwl test.yaml

It doesn't take the newly corrected cwl, but it is using some copy of the original. I can't find in the job store exactly where (and how) this is cached in the job store? Also the 'test.yaml' in the above command is completely ignored (even though toil still requires me to specify a yaml file) and uses also the original inputs. It's also not clear where this is stored in the job store.

Further on AWS job store, all the file names are hashed, so I cannot tell what the file are referring to (and where are intermediate files from subworkflows in earlier steps are located, if I had more complex workflows).

Version of toil used: 5.7.1

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1253

adamnovak commented 1 year ago

It's not that it caches the CWL file in the job store; it's that it translates the CWL workflow steps into Python objects representing the individual jobs that need to run for the wrokflow to run (things like running a command line tool with arguments, or running a conditional section of the workflow, or whatever) and save those.

After that is all set up, I don't think we consult the CWL file again, although the objects we parsed out of it (like File and Directory objects, and objects representing CWL tool definitions) can end up in fields in the job objects.

So by the time we start the workflow we've already basically compiled the CWL file to jobs, and we don't really consult it anymore, so changing it won't let you change the workflow.

We could change the restart command not to need the CWL file or the inputs file, I think (if we could manage to decide what to upload where at the end all from the return value of the workflow). But figuring out how to update an in-progress workflow when a CWL file updates is hard. We can handle this to some extent with Python workflows: we actually interpret the Python code for each job to run that job, so if you edit the Python code then the new code is what gets deployed when each job starts to run, and we try to unpickle the old objects with the new code and run them, which often works.

But with CWL we aren't shipping the CWL files themselves and consulting them to see what to run at each point.

@mr-c How feasible do you think it would be to change the design here and still use cwltool's guts? I could imagine a setup where we actually just ship the CWL workflow the same way we ship Python code, as a Toil "Resource", and then instead of loading the CWL tools and workflows and stuff into Python objects on the leader and shipping those, we instead keep references to the CWL workflow tasks or steps that each job is supposed to represent, and have the worker load them fresh from the CWL files every time it needs them.

mr-c commented 1 year ago

After that is all set up, I don't think we consult the CWL file again, although the objects we parsed out of it (like File and Directory objects, and objects representing CWL tool definitions) can end up in fields in the job objects.

That's correct.

@mr-c How feasible do you think it would be to change the design here and still use cwltool's guts?

I could imagine a setup where we actually just ship the CWL workflow the same way we ship Python code, as a Toil "Resource", and then instead of loading the CWL tools and workflows and stuff into Python objects on the leader and shipping those, we instead keep references to the CWL workflow tasks or steps that each job is supposed to represent, and have the worker load them fresh from the CWL files every time it needs them.

I think that would only work for workflows where the shape of the workflow is the same; perhaps only if the contents of the CommandLineTool descriptions change. That might be enough for most people to make it worth while, I don't know.

For example, if sub-workflows are not evaluated and turned into Toil jobs until the step that contains them is ready for execution, then that helps a lot.

Or maybe you are saying, for each step in a workflow, we record a list of Toil Jobs and a hash of the step definition, and if the hashes match on restart then the cached output is used?

adamnovak commented 9 months ago

@DailyDreaming suggests an approach where, on restart, we load all the jobs and diff them against what we get from the current workflow text, and fix up the stored jobs to match the workflow if possible.

We'd have to adjust how we hook into pickling with promises to make that work.

He also raises the point that re-parsing the workflow text for every job might be slow. But that does seem to be the approach taken by e.g. Snakemake.

unito-bot commented 9 months ago

➤ Adam Novak commented:

Another approach would be to store the workflow progress in CWL terms, and when restarting, not do a Toil-level restart but instead make CWL job orders or something for everything remaining in the CWL workflow.

unito-bot commented 9 months ago

➤ Adam Novak commented:

A fix for this should also solve this problem for WDL. We might want to try and unify some of the interpreter code while we’re at it, and use a common representation of “run this thing with these workflow files” across the two.

mr-c commented 9 months ago

For matching previous runs, which is basically memoization, see how cwltool handles caching: https://github.com/common-workflow-language/cwltool/blob/d2059c7dba480a93e9afc51e6e740979ebe6f6e8/cwltool/command_line_tool.py#L816