Open AlexTate opened 8 months ago
This pull request has been mentioned on Common Workflow Language Discourse. There might be relevant details there:
https://cwl.discourse.group/t/how-to-fail-fast-during-parallel-scatter/868/5
Attention: Patch coverage is 57.25191%
with 56 lines
in your changes missing coverage. Please review.
Project coverage is 77.06%. Comparing base (
73b742f
) to head (105fee9
).
:exclamation: There is a different number of reports uploaded between BASE (73b742f) and HEAD (105fee9). Click for more details.
HEAD has 5 uploads less than BASE
| Flag | BASE (73b742f) | HEAD (105fee9) | |------|------|------| ||17|12|
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Summary
This pull request introduces a new choice,
kill
, for the--on-error
parameter.Motivation
There currently isn't a way to have cwltool immediately stop parallel jobs when one of them fails. One might expect
--on-error stop
to accomplish this, but the help string is specific and accurate: "do not submit any more steps". Since scatter and subworkflow are treated as single "steps" within the parent workflow, this means cwltool is not wrong to wait for the rest of the step's parallel jobs to finish when--on-error stop
. However, sometimes individual scatter jobs take a long time to complete, so if one of them fails early on, cwltool might wait great lengths of time for the other scatter jobs to complete before terminating the workflow. With--on-error kill
, all running jobs are quickly notified and self-terminate upon one job's failure.Demonstration of the Issue
When running the following workflow with
cwltool --parallel --on-error stop
, the total runtime is ~33 seconds despite one of the scatterstep tasks terminating unexpectedly. Ideally the workflow would terminate immediately.--on-error kill
accomplishes that.Forum Post
https://cwl.discourse.group/t/how-to-fail-fast-during-parallel-scatter/868
Concerns
workflow_eval_lock.release()
had to be moved to the finally block inMultithreadedJobExecutor.run_jobs()
JobBase._execute()
due toif runtimeContext.kill_switch.is_set(): return
? For that matter, shouldn't there be a finally block to contain some of these steps such as deleting runtime-generated files containing secrets?