DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

Option --clean=never is not being respected by toil-wdl-runner #2600

Open pb-cdunn opened 5 years ago

pb-cdunn commented 5 years ago

I use vocab.wdl and run this:

toil-wdl-runner --jobStore=toilWorkflowRun --logDebug --clean=never --stat vocab.wdl vocab.json

I get this:

INFO:toil.leader:Finished toil run successfully.
INFO:toil.common:Successfully deleted the job store: FileJobStore(/localdisk/.../toilWorkflowRun)

How do I keep the job-store with the WDL runner? Or does restart simply not work with completed WDL workflows?

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-36

adamnovak commented 5 years ago

You can't meaningfully restart a completed workflow. But the job store still can have useful stuff in it, like statistics.

It's definitely a bug that --clean=never is not being respected by toil-wdl-runner.

pb-cdunn commented 5 years ago

You can't meaningfully restart a completed workflow. But the job store still can have useful stuff in it, like statistics.

Well, it worked for helloWorld.py, after changing from start() to restart(). But what if I change a task? I would expect the restarted workflow to skip the tasks up to the change, and then do the new task for the change? That's not how it works?

That is very important for development, and I am looking for any workflow engine which supports that kind of restart after change.

adamnovak commented 5 years ago

It sounds like you're thinking of Toil more like make: when you change things and rerun, make will rebuild everything that depends on what changed, and keep what doesn't depend on changed things the same. (Although, make doesn't rebuild things when the rules themselves change.)

Toil doesn't work like that; it just runs jobs, which may produce other jobs that need to run. If a job succeeds, it gets removed, and all the serialized arguments passed to it are deleted, so it can't be run again unless it is re-generated by re-running the job that issued it. If a job fails, Toil keeps the job around to be rerun, and it lets you edit your code to correct mistakes and continue the workflow, rerunning the formerly-failed jobs with the new code. But if the failed jobs failed because they are fundamentally flawed (e.g. they represent function calls with incorrect arguments), there's not much you can do; you can't go back and edit the job that created the failed jobs, which itself succeeded, and rerun from there. You can only start again from the beginning.

(There's a complication here also because Toil executes WDL by generating Python code; if you change your WDL workflow enough, adding and removing steps, I think Toil might generate different function names in the Python code, and the old job store from a failed workflow might be useless with the new generated Python code.)

A restart in Toil just picks up the outstanding jobs that already exist in the job store and tries running them. It doesn't compare the code that it currently has to the code that was used to originally run the workflow that created the job store, and try and rerun places where they differ. If there are no outstanding jobs to run in a job store, then it doesn't make sense to "restart", because there's no work left to do; all the jobs finished and got removed. In that case, you should just start again from the beginning with a fresh job store.

If you did manage to use restart() on a job store from a successfully completed Python workflow and have it run the whole workflow again from the top, that's weird. I don't think that should happen; it should give an error that there's no work to do. It might be that Toil is getting a bit confused; --restart is telling it not to fail because the job store already exists, but then when it goes to actually start doing work, it sees that there are no jobs needing to be re-run, so it decides it must not be restarting and begins from the start of the workflow. I don't think that that is what we want it to do.

pb-cdunn commented 5 years ago

I'm familiar with make, but no, I'm thinking of Toil like Cromwell, another WDL-based workflow runner. When I develop in Cromwell, I can get a workflow running, then add a subsequent task, and restart. The already completed tasks are still cached in the database, so they are "skipped". Unfortunately, skipping tasks wastes 5 seconds per task, which is too slow, so I thought I'd try Toil instead.

Toil does a great job resume-after-failure. But it does not support resume-after-extending-workflow. I can probably live without the latter, as that's rare. Once a workflow has been fully implemented, changes are usually localized to a single task.

However, I always need to be able to cd into the run-directory of a successful task so that I can re-run that bit while I alter my source-code. In that case, it's also helpful to be able to run-to-completion after my change. I'm surprised that's not a more common development model.

I guess I like Cromwell's architecture and Toil's speed. Thanks for answering my questions.

adamnovak commented 5 years ago

I don't believe that Toil uses the same run-directory model that Cromwel does, at the moment. For one thing, Toil supports job stores like AWS S3, which don't really have support for the hardlinks/symlinks you would use to provide a run-directory without duplicating all the data. We're thinking about implementing better inspection tools, though, and some kind of checkpointing for restarts in completed workflows, so we might eventually be able to support these sorts of features.

On 4/16/19, Christopher Dunn notifications@github.com wrote:

I'm familiar with make, but no, I'm thinking of Toil like Cromwell, another WDL-based workflow runner. When I develop in Cromwell, I can get a workflow running, then add a subsequent task, and restart. The already completed tasks are still cached in the database, so they are "skipped". Unfortunately, skipping tasks wastes 5 seconds per task, which is too slow, so I thought I'd try Toil instead.

Toil does a great job resume-after-failure. But it does not support resume-after-extending-workflow. I can probably live without the latter, as that's rare. Once a workflow has been fully implemented, changes are usually localized to a single task.

However, I always need to be able to cd into the run-directory of a successful task so that I can re-run that bit while I alter my source-code. In that case, it's also helpful to be able to run-to-completion after my change. I'm surprised that's not a more common development model.

I guess I like Cromwell's architecture and Toil's speed. Thanks for answering my questions.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/DataBiosphere/toil/issues/2600#issuecomment-483711705

davidlougheed commented 3 years ago

Any word on this? I'm running into this as well and it's quite frustrating

davidlougheed commented 3 years ago

I noticed this line https://github.com/DataBiosphere/toil/blob/5538d4eea8279243504c234d34709e27cfee0e2c/src/toil/wdl/wdl_synthesis.py#L209 which seems to manually always set clean to always for WDL jobs

DailyDreaming commented 3 years ago

@davidlougheed Yes, that's the culprit. I'll take a look and see about a fix.