argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.08k stars 3.2k forks source link

Retrying specific failed node does not work #12543

Open mio4kon opened 10 months ago

mio4kon commented 10 months ago

Suggested Enhancement

Allow users to selectively retry specific failed nodes instead of retrying all failed nodes at once.

Use Cases

I'm using Argo Workflow, and at times, I would like the option to retry a specific failed node, instead of retrying all failed nodes (similar to GitLab CI's capability). Even if the overall pipeline still ends up failing, there are specific tasks that I'd prefer to retry without consuming excessive resources retrying other nodes that may inevitably fail. I believe providing users with this level of flexibility is important.

For example, in the following pipeline, I might prefer to only rerun the failed nodes of BB, rather than retrying both B and C nodes.

image

--- updates by agilgur5 below---

jswxstw commented 10 months ago

I also want the argo-cli support Skip operation, which support skipping the specific failed nodes and modifying the outputs.

agilgur5 commented 10 months ago

Retrying specific nodes is already possible, see #12005 for an example

agilgur5 commented 10 months ago

I alse [sic] want the argo-cli support Skip operation, which support skipping the specific failed nodes and modifying the outputs.

Please keep feature requests on-topic to 1 per issue.

The behavior you're asking for would not be supported though, nodes are only considered skipped if they have a conditional that skips them. A "skip" operation also doesn't exist in other DAG orchestrators as far as I know either

mio4kon commented 10 months ago

Retrying specific nodes is already possible, see #12005 for an example

There is a slight difference compared to what was mentioned earlier.

#12005 the retry of specific nodes here only supports specifying successful nodes, and it's not possible to specify failed nodes. --node-field-selector and --restart-successful must be used together.

mio4kon commented 10 months ago

I alse want the argo-cli support Skip operation, which support skipping the specific failed nodes and modifying the outputs.

Please keep feature requests on-topic to 1 per issue.

The behavior you're asking for would not be supported though, nodes are only considered skipped if they have a conditional that skips them. A "skip" operation also doesn't exist in other DAG orchestrators as far as I know either

@jswxstw The issue can be partially resolved by using [ A || A.failed ], but it is overly automated. @agilgur5 In some scenarios, after a failure, manual confirmation may be needed to proceed with the subsequent steps. However, currently, it is unclear how to implement this workflow.

jswxstw commented 10 months ago

You have set continueOn Failed to the B and C nodes, so you do not want to retry these steps when manual retrying, am I understanding correctly?

mio4kon commented 10 months ago

You have set continueOn Failed to the B and C nodes, so you do not want to retry these steps when manual retrying, am I understanding correctly?

No, what I replied to is not the issue I raised. Instead, it's about the suggestion you mentioned to add a button for skipping failures. I believe the suggestion you made is also very useful for me, as we encounter similar scenarios on our end.

agilgur5 commented 10 months ago

#12005 the retry of specific nodes here only supports specifying successful nodes, and it's not possible to specify failed nodes.

The default behavior of retry is to only retry failed nodes.

--node-field-selector and --restart-successful must be used together.

If you need to use them together, you can.

There is a slight difference compared to what was mentioned earlier.

Sorry it's not clear to me what the difference is. Retrying specific nodes is possible (whether succeeded or failed). Is there something else missing that you would like?

agilgur5 commented 10 months ago

@jswxstw The issue can be partially resolved by using [ A || A.failed ], but it is overly automated.

Yes, this would be one of way of implementing it.

about the suggestion you mentioned to add a button for skipping failures

Skipping a step is a semantically different Workflow. The currently available operations do not change the intent or conditionals of the Workflow, and they should not. Operations only modify the Workflow's status, they do not modify the Workflow spec itself, and should not do so. So a "Skip" operation as proposed violates semantic intent and would not be accepted as a feature as such.

@agilgur5 In some scenarios, after a failure, manual confirmation may be needed to proceed with the subsequent steps. However, currently, it is unclear how to implement this workflow.

That sounds like a 3rd, separate question from the other two; I'm not sure how this is related to "skipping". There's really too many topics here...

You can use a suspend template to require manual confirmation, so that too is already possible.

mio4kon commented 10 months ago

#12005 the retry of specific nodes here only supports specifying successful nodes, and it's not possible to specify failed nodes.

The default behavior of retry is to only retry failed nodes.

--node-field-selector and --restart-successful must be used together.

If you need to use them together, you can.

There is a slight difference compared to what was mentioned earlier.

Sorry it's not clear to me what the difference is. Retrying specific nodes is possible (whether succeeded or failed). Is there something else missing that you would like?

image

The pipeline in the image above does not currently allow for selective retries of specific nodes, such as the BB node. Using--node-field-selector alone does not actually take effect; it still retries all failed nodes.

@agilgur5 Please help me review this PR 🙏 https://github.com/argoproj/argo-workflows/pull/12553

agilgur5 commented 10 months ago

Using--node-field-selector alone does not actually take effect; it still retries all failed nodes.

Yea that sounds like a bug, afaik, --node-field-selector is supposed to be able to be used without --restart-successful. Strange that it hasn't been noticed earlier though, I wonder if there was a regression 🤔

This issue was filed as a feature request though, a reproducible Workflow and set of commands / instructions would be helpful to test with.

@agilgur5 Please help me review this PR 🙏 #12553

I'll take a look, thanks for checking the code

mio4kon commented 9 months ago

@agilgur5 Please help me review this PR 🙏 #12553

I'll take a look, thanks for checking the code

hi,Is there still a problem with the PR corresponding to this issue? Can you reopen this issue? I think it's still a bit of a hassle to not be able to retry a single failed node😭

agilgur5 commented 9 months ago

hi,Is there still a problem with the PR corresponding to this issue?

Yes. You can use the "request a review" function on GitHub when you've made iterations.

Also, please do not expect immediate responses from open source maintainers, who are largely volunteers.

Can you reopen this issue? I think it's still a bit of a hassle to not be able to retry a single failed node😭

Sure, but as I wrote above, the issue is written as a feature request, not a reproducible bug report, which is confusing and missing information as a result.

mio4kon commented 8 months ago

hi,Is there any follow-up plan for fixing this issue? @agilgur5

isubasinghe commented 1 month ago

https://github.com/argoproj/argo-workflows/pull/13734 retry rewrite here