Adding new choice to --on-error

AlexTate commented 8 months ago

Summary

This pull request introduces a new choice, kill, for the --on-error parameter.

Motivation

There currently isn't a way to have cwltool immediately stop parallel jobs when one of them fails. One might expect --on-error stop to accomplish this, but the help string is specific and accurate: "do not submit any more steps". Since scatter and subworkflow are treated as single "steps" within the parent workflow, this means cwltool is not wrong to wait for the rest of the step's parallel jobs to finish when --on-error stop. However, sometimes individual scatter jobs take a long time to complete, so if one of them fails early on, cwltool might wait great lengths of time for the other scatter jobs to complete before terminating the workflow. With --on-error kill, all running jobs are quickly notified and self-terminate upon one job's failure.

Demonstration of the Issue

When running the following workflow with cwltool --parallel --on-error stop, the total runtime is ~33 seconds despite one of the scatterstep tasks terminating unexpectedly. Ideally the workflow would terminate immediately. --on-error kill accomplishes that.

#!/usr/bin/env cwl-runner

class: Workflow
cwlVersion: v1.2

inputs:
  sleeptime:
    type: int[]
    default: [ 33, 33, 33, 33, 33 ]
outputs: { }
requirements:
  - class: ScatterFeatureRequirement

steps:
  scatterstep:
    in: { sleeptime: sleeptime }
    out: [ ]
    scatter: sleeptime
    run:
      class: CommandLineTool
      baseCommand: sleep
      inputs:
        sleeptime: { type: int, inputBinding: { position: 1 } }
      outputs: { }
  kill:
    in: { }
    out: [ ]
    run:
      class: CommandLineTool
      baseCommand: [ 'bash', '-c' ]
      arguments:
        - |
          # Wait 1 second for scatter to spin up then select a random sleep process to kill
          sleep 1
          ps -ef | grep 'sleep 33' | grep -v grep | awk '{print $2}' | shuf | head -n 1 | xargs kill -9
      inputs: { }
      outputs: { }

Forum Post

https://cwl.discourse.group/t/how-to-fail-fast-during-parallel-scatter/868

Concerns

workflow_eval_lock.release() had to be moved to the finally block in MultithreadedJobExecutor.run_jobs()
Are any important steps skipped in JobBase._execute() due to if runtimeContext.kill_switch.is_set(): return? For that matter, shouldn't there be a finally block to contain some of these steps such as deleting runtime-generated files containing secrets?
The kill switch response in TaskQueue is fairly loose. Since the response is primarily handled at the job level, any tasks that start after the kill switch is activated will take care of themselves and self terminate

cwl-bot commented 8 months ago

This pull request has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/how-to-fail-fast-during-parallel-scatter/868/5

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 57.25191% with 56 lines in your changes missing coverage. Please review.

Project coverage is 77.06%. Comparing base (73b742f) to head (105fee9).

Files	Patch %	Lines
cwltool/job.py	53.65%	29 Missing and 9 partials :warning:
cwltool/task_queue.py	41.66%	5 Missing and 2 partials :warning:
cwltool/executors.py	37.50%	4 Missing and 1 partial :warning:
cwltool/errors.py	57.14%	3 Missing :warning:
cwltool/workflow_job.py	84.61%	0 Missing and 2 partials :warning:
cwltool/workflow.py	83.33%	0 Missing and 1 partial :warning:

:exclamation: There is a different number of reports uploaded between BASE (73b742f) and HEAD (105fee9). Click for more details.

HEAD has 5 uploads less than BASE
| Flag | BASE (73b742f) | HEAD (105fee9) | |------|------|------| ||17|12|

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1974 +/- ## ========================================== - Coverage 83.81% 77.06% -6.76% ========================================== Files 46 46 Lines 8262 8333 +71 Branches 2199 2120 -79 ========================================== - Hits 6925 6422 -503 - Misses 856 1350 +494 - Partials 481 561 +80 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

common-workflow-language / cwltool