Rerun workflows - Githubissues

vinjana commented 1 year ago

What

Many workflow engines have a rerun feature that allows the user to restart the workflow at a safe stage after it was stopped ungracefully.

Why

Workflows may terminally end in many situations, including unrecoverable error in the infrastructure (e.g. batch processing system, storage). Some of these conditions may be covered by automatic restarts, like some infrastructure problems, others, however, are best be handled by cancellation or automatic termination of workflows, such as workflow bugs, inability to deal with unexpected inputs, etc. One valid situation that we repeatedly observe is a workflow failing because some odd input data requires 10 times the memory or takes 10 times longer than a "normal" sample (e.g. normal cancer sample vs. chromothrypsis; some optimization algorithms with occasionally bad convergence characteristics)

In principle, workflow runs may be started from the beginning, and indeed some workflow engines do even not provide a restart feature. (Restarting from the beginning, i.e. with a new RunRequest, in any case is the safer solution if one does not trust an engine's rerun feature.)

However, restarting workflows at a later stage is still interesting, because it may safe resources (CPU, IO, energy/CO2), and time -- which is important in some application areas, like the timely processing of patient data.

How

The request would just start exactly the same workflow again, i.e., use the same parameters. There may also be use-cases where a parameter may differ for a rerun. For instance, it may be necessary to increase the memory and time resources for an odd sample.

The rerun may be applied to a workflow in any terminal-state, such as CANCELED, EXECUTOR_ERROR, SYSTEM_ERROR, but obviously not in a running state, like RUNNING, QUEUED, etc.

The correct rerunning itself should be left to the workflow engine. We should not care whether the engine or even the workflow allow for this rerunning (that's outside the responsibility of a WES, I think).

This could be implemented as a separate RerunRequest route (e.g. with the run-id in the route). The whole feature should be optional, because it does not make sense to require the implementation of a RerunRequest API endpoint for a workflow engine that cannot do reruns.

uniqueg commented 1 year ago

Very reasonable use case, of course.

But is there anything stopping you from implementing this functionality without there being a specific endpoint to trigger it? Just like, e.g., Snakemake does it? From a WES client's point of view, surely it's trivial to implement re-run functionality, either with identical or updated parameters ("just re-trigger that previous run as is or give me a copy of the form so that I can modify it first"). And a WES can surely (try to) identify re-runs or partial re-runs and reuse cached artifacts, from containers to input data to intermediate data to run checkpoints, regardless of whether the client specifically asks for it or not. And this could be restricted to a particular user's previous runs. Or not. Possibly dependening on the tradeoff between security/privacy considerations and costs/performance. Or we could add a boolean field for do_not_reuse_artifacts for cases where we want to force a fresh run.

A couple of other thoughts:

OpenAPI 3 does not formally provide for optional paths or methods. On the other hand, the choice to implement any endpoint is entirely optional, of course. Maybe the question is, then, at which point do we start/stop calling an implementation a WES implementation? Is GET /tasks essential? Or how about a read-only WES?
Actually, any WES has the ability to re-run a workflow, independent of the engine. Only the ability to reuse artifacts of a previous run requires support by the engine (and the system, which needs to be able and willing to preserve such artifacts). So if we decide to go for a specific route for this use case, in principle we do not really need to make it optional.
The functionality could also be provided through an (optional) parameter reuse_from in the form, which takes the ID of a previous run. The engine could then try to use from that workflow run whatever it can (provided the user has access, else 403).

vsmalladi commented 1 year ago

I like the reuse_from as a field that can provide a previous ID for re-run paramenters.

uniqueg commented 1 year ago

I like the reuse_from as a field that can provide a previous ID for re-run paramenters.

Thinking about it again, I'm not sure how it would work though. And isn't there a danger that it may raise expectations on a given WES to support caching for re-runs? I mean, if it's just about getting the form parameters, a client could also look at Log to get them.

What I do think would be good to have to address this issue:

A caching parameter in a capabilities section in the service info that a server instance could use to broadcast whether it is able to reuse old runs or not. This could either be boolean, an enum of pre-defined values (e.g., none, user, global to define that an implemention may be able/willing to check for cached artifacts only in the user or in the global space) or even a free enum where implementations would then provide descriptions of what the caching values mean (and which MUST at the very least include none to allow users to disable caching).
A use_cache parameter in the RunRequest form to allow the user to pick one of the supported caching strategies or explicitly disable the use of caching.
Possibly another parameter, e.g., cache_run to enable the client to indicate whether it is okay for the server to cache artifacts of the run to be used by, e.g., the same user in the future, or all users of that WES.

With these fields, instances could broadcast hints as to their ability to use caches for re-runs (and their limitations). It would also enable the client to opt out of caching for security reasons or in case re-runs because of broken caches etc.

And I think @vinjana's use case could be solved by extending the documentation for implementers as to how we imagine that caches could be used via the regular POST /runs request when a sufficiently similar request comes in.

ga4gh / workflow-execution-service-schemas

Rerun workflows #203

What

Why

How