argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.83k stars 3.17k forks source link

Separate Continuation after Task Failure and Success Determination #12530

Open ryancurrah opened 7 months ago

ryancurrah commented 7 months ago

Description

I've encountered a situation in Argo Workflows where, when using a Directed Acyclic Graph (DAG) with FailFast set to true, there's a need to allow a workflow to continue even after a task fails, but still have the workflow recognize the failure at the end. Currently, by using depends: "B.Succeeded || B.Failed" in a task, we can ensure that a subsequent task (like task C) will run even if task B fails. However, this approach also leads to the workflow being marked as successful, despite the failure of task B (assuming no other tasks fail).

This behavior is not explicitly documented in the enhanced depends logic documentation, and we need to distinguish between continuing a workflow after a task failure and the final success/failure determination of the workflow itself.

Related Slack conversation.

Suggested Enhancement

I propose an enhancement where there's a clear separation between allowing a workflow to continue after a task failure and determining the overall success or failure of the workflow. This could potentially be an additional attribute or a modification in the depends logic.

Use Case

This enhancement would be beneficial in scenarios where certain tasks are critical but not blockers for the continuation of the workflow. Teams would have the flexibility to allow the workflow to proceed with subsequent tasks, while still being able to flag the workflow as failed if any critical task doesn't succeed.

Expected Behavior

With this enhancement, users should be able to:


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritize the proposals with the most 👍.

agilgur5 commented 6 months ago

This could also apply to hooks (and exit handlers), which have a bit of a similar confusion among them.

agilgur5 commented 6 months ago

Perhaps this could be something like a ignoreForFinalStatus field (or a simpler name). Defaults to false for backward-compat, but if you set it to true, the node's exit code / status has no impact on the overall Workflow status.

ryancurrah commented 5 months ago

That seems like an acceptable solution if we change the depends logic to no longer affect the outcome of workflow results.

  1. Adjusting depends Logic: The depends logic would continue to dictate the execution order of tasks based on their success or failure but would not directly influence the final workflow status. This change ensures that workflow execution is still flexible, allowing subsequent tasks to run even if previous tasks fail, based on the defined dependencies.
  2. Introducing ignoreFailure Field: A new field (e.g., ignoreFailure) would be added to each task definition. This field allows workflow designers to explicitly mark certain tasks as non-critical for the overall workflow success. If ignoreFailure is set to true for a task, its failure will not prevent the workflow from being marked as successful, provided all other tasks deemed critical are successful.
tasks:
  - name: optional-setup
    template: setup-template
    ignoreFailure: true  # This task's failure won't affect the workflow's final status

  - name: critical-analysis
    template: analysis-template
    ignoreFailure: false  # This task's failure will cause the workflow to fail

  - name: cleanup
    template: cleanup-template
    depends: "optional-setup.Succeeded || optional-setup.Failed"
    ignoreFailure: true  # Cleanup runs regardless of previous task success but also doesn't affect final status

I think this would be a breaking change though if people expected depends to affect the final results of a workflow.

ryancurrah commented 5 months ago

Also for others here is a workaround:

Provide the task step that has the depends logic depends: "B.Succeeded || B.Failed" with parameters of the previous task step B status and if it failed return a non-zero exit code.

- name: reportquality
  arguments:
    parameters:
      - name: b_status
        value: "{{tasks.B.status}}"
[ "{{inputs.parameters.b_status}}" = "Failed" ] && { echo "B task step failed. exiting with non-zero status."; exit 1; } || true
agilgur5 commented 5 months ago

Wait a minute, isn't this already possible with the continueOn field?

ryancurrah commented 5 months ago

Wait a minute, isn't this already possible with the continueOn field?

My understanding is that it doesn't work with depends and I tried continueOn without depends and using dependencies and it still did not work.

agilgur5 commented 5 months ago

However, this approach also leads to the workflow being marked as successful, despite the failure of task B (assuming no other tasks fail).

Oh right, you're trying to do the opposite of continueOn.

I think this would be a breaking change though if people expected depends to affect the final results of a workflow.

That makes more sense why it would need a breaking change. This is unlikely to happen though, and I'm not sure that that behavior is more intuitive; arguably it is less intuitive.

A non-breaking way to do this would be to introduce a new top-level Workflow flag, similar to failFast, that affects the processing of the whole DAG. But I don't think we've ironed out how to do this intuitively yet.

My understanding is that it doesn't work with depends and I tried continueOn without depends and using dependencies and it still did not work.

You might be able to do this by adding steps inside of your dag.

But I think this still wouldn't achieve your goal, since you need the opposite of continueOn. Something like "continueOn but still fail". Or a custom ignoreStatus field that lets you choose which statuses to ignore (then you could ignore Succeeded as well). This is getting pretty complicated 😅

ryancurrah commented 5 months ago

Yeah it seems like a simple ask at first. This is coming from a need to upload reports with the results of previous tasks (If they failed or not) and fail the entire workflow if the reports contain issues.