Open isubasinghe opened 1 month ago
Deprecate retrying of failed nodes by default.
Currently a user might have a
continueOn: failed
ordepends: $TASKNAME.Failed
and they might consider this node to have succeeded
A little confused by the title vs description here -- do you mean all failed nodes or only ones using the aforementioned features?
The "all" scenario raises the question of "then what should the default behavior of retry be?"
This allows one to precisely retry a single node without worrying about a failed node also retrying.
This is already possible with --node-field-selector
A little confused by the title vs description here -- do you mean all failed nodes or only ones using the aforementioned features?
The current retry code attempts to retry all failed nodes. This happens even when you specify with node-field-selector a completely different node.
--node-field-selector
not working correctly is a bug, not a feature 😅
If that's the main thing you were referencing, this sounds like a duplicate of https://github.com/argoproj/argo-workflows/issues/12543. You actually reviewed the PR for it there back in Feb when I asked someone with more knowledge of the retry logic to take a look, particularly because of potential breakage
Ah yes I remember this issue now, I don't think the existing behaviour is a bug by looking at the code anyway. It seems very intentional that the author intended the failed nodes to be retried.
I rewrote the retry logic from scratch (because it had the wrong approach) and kept this behaviour.
I think regardless of whether its a bug or not doesn't matter, this retry behaviour has existed since 2017-2018 and effectively is a feature now.
I suggest we keep the existing beahviour in $NEW_VERSION and deprecate it, then remove it entirely in $NEW_VERSION + 1.
I rewrote the retry logic from scratch (because it had the wrong approach)
The author illuminated this to me in https://github.com/argoproj/argo-workflows/pull/12553#discussion_r1493314678 too and asked about a refactor in https://github.com/argoproj/argo-workflows/pull/12553#discussion_r1496958246.
Well that and the other bugs with it made me take a look and get really confused by the approach too
It seems very intentional that the author intended the failed nodes to be retried.
It seemed like --node-field-selector
was only intended to be used with --restart-successful
per your read, the diagrams linked above, and the original PR: https://github.com/argoproj/argo-workflows/pull/2431
I agree with other folks that the current behavior feels pretty confusing though.
I think regardless of whether its a bug or not doesn't matter, this retry behaviour has existed since 2017-2018 and effectively is a feature now.
I suggest we keep the existing beahviour in $NEW_VERSION and deprecate it, then remove it entirely in $NEW_VERSION + 1.
Yea that's more or less what I suggested in https://github.com/argoproj/argo-workflows/pull/12553#discussion_r1495461222 but minus a deprecation; i.e. a breaking fix in the same spirit as https://github.com/argoproj/argo-workflows/pull/11005
I would say we should close this out as duplicate of #12543 and make a breaking fix to --node-field-selector
if that's the main thing you were referencing (you didn't mention that in the issue description, so I'm not sure)
I disagree with a breaking fix here, we should do a deprecation.
I am sure there are some users that have now grown used to the retry button as a way to restart all failed nodes.
To be clear, the breaking fix I'm advocating for would only impact --node-field-selector
, so it would be quite similar to #11005 in that respect as well: in both cases, the breakage is only when a certain option is used, making that option's behavior "saner"
Summary
Currently a user might have a
continueOn: failed
ordepends: $TASKNAME.Failed
and they might consider this node to have succeeded from their point of view (In the sense that the failure is no big deal or that it was expected). Currently the retry logic attempts to reset these nodes, which is an introduction of policy that is forced upon the user.We should in the very least be able to opt out of this behaviour.
Use Cases
This allows one to precisely retry a single node without worrying about a failed node also retrying.