argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.12k stars 3.21k forks source link

Deprecate retrying of failed nodes by default. #13692

Open isubasinghe opened 1 month ago

isubasinghe commented 1 month ago

Summary

Currently a user might have a continueOn: failed or depends: $TASKNAME.Failed and they might consider this node to have succeeded from their point of view (In the sense that the failure is no big deal or that it was expected). Currently the retry logic attempts to reset these nodes, which is an introduction of policy that is forced upon the user.

We should in the very least be able to opt out of this behaviour.

Use Cases

This allows one to precisely retry a single node without worrying about a failed node also retrying.

agilgur5 commented 1 month ago

Deprecate retrying of failed nodes by default.

Currently a user might have a continueOn: failed or depends: $TASKNAME.Failed and they might consider this node to have succeeded

A little confused by the title vs description here -- do you mean all failed nodes or only ones using the aforementioned features?

The "all" scenario raises the question of "then what should the default behavior of retry be?"

This allows one to precisely retry a single node without worrying about a failed node also retrying.

This is already possible with --node-field-selector

isubasinghe commented 1 month ago

A little confused by the title vs description here -- do you mean all failed nodes or only ones using the aforementioned features?

The current retry code attempts to retry all failed nodes. This happens even when you specify with node-field-selector a completely different node.

agilgur5 commented 1 month ago

--node-field-selector not working correctly is a bug, not a feature 😅

If that's the main thing you were referencing, this sounds like a duplicate of https://github.com/argoproj/argo-workflows/issues/12543. You actually reviewed the PR for it there back in Feb when I asked someone with more knowledge of the retry logic to take a look, particularly because of potential breakage

isubasinghe commented 1 month ago

Ah yes I remember this issue now, I don't think the existing behaviour is a bug by looking at the code anyway. It seems very intentional that the author intended the failed nodes to be retried.

I rewrote the retry logic from scratch (because it had the wrong approach) and kept this behaviour.

I think regardless of whether its a bug or not doesn't matter, this retry behaviour has existed since 2017-2018 and effectively is a feature now.

I suggest we keep the existing beahviour in $NEW_VERSION and deprecate it, then remove it entirely in $NEW_VERSION + 1.

agilgur5 commented 1 month ago

I rewrote the retry logic from scratch (because it had the wrong approach)

The author illuminated this to me in https://github.com/argoproj/argo-workflows/pull/12553#discussion_r1493314678 too and asked about a refactor in https://github.com/argoproj/argo-workflows/pull/12553#discussion_r1496958246.

Well that and the other bugs with it made me take a look and get really confused by the approach too

It seems very intentional that the author intended the failed nodes to be retried.

It seemed like --node-field-selector was only intended to be used with --restart-successful per your read, the diagrams linked above, and the original PR: https://github.com/argoproj/argo-workflows/pull/2431

I agree with other folks that the current behavior feels pretty confusing though.

I think regardless of whether its a bug or not doesn't matter, this retry behaviour has existed since 2017-2018 and effectively is a feature now.

I suggest we keep the existing beahviour in $NEW_VERSION and deprecate it, then remove it entirely in $NEW_VERSION + 1.

Yea that's more or less what I suggested in https://github.com/argoproj/argo-workflows/pull/12553#discussion_r1495461222 but minus a deprecation; i.e. a breaking fix in the same spirit as https://github.com/argoproj/argo-workflows/pull/11005

agilgur5 commented 1 month ago

I would say we should close this out as duplicate of #12543 and make a breaking fix to --node-field-selector if that's the main thing you were referencing (you didn't mention that in the issue description, so I'm not sure)

isubasinghe commented 1 month ago

I disagree with a breaking fix here, we should do a deprecation.

I am sure there are some users that have now grown used to the retry button as a way to restart all failed nodes.

agilgur5 commented 1 month ago

To be clear, the breaking fix I'm advocating for would only impact --node-field-selector, so it would be quite similar to #11005 in that respect as well: in both cases, the breakage is only when a certain option is used, making that option's behavior "saner"