Open starwarsfan opened 3 weeks ago
This would be a fairly radical departure for how manual retries are implemented. Perhaps a way to satisfy the needs would be to do something with archiving the failed workflow before retry, but that also is not simple as the key
is the name of the workflow which doesn't change during a manual retry.
See also https://github.com/argoproj/argo-workflows/issues/12324#issuecomment-2364753512 / https://github.com/argoproj/argo-workflows/pull/9141#issuecomment-2077864002.
The current way manual retries are implemented is itself different from the rest of workflow operations, which try to record things on the Workflow
resource more immutably / append-only (neither of those terms is quite accurate, but conceptually similar). Perhaps it should be done more similar to a resubmit that retains partial state
For example, manual re-runs on GitHub Actions work similarly. It creates a new state without deleting the old state and removes the state that is to be re-ran.
Logs and pods (including labels) etc would be partially linked between the two, which is a bit confusing and may create some race conditions, but possibly solveable
agreed. But it is too tough to copy and restart the failed workflow. It will loss the state. I think we need to find the way to preserve the failed nodes/steps and start the new nodes/steps like inject the retry flag or something
i believe another uid row is stored in public.argo_archived_workflows or is that just for resubmit
agreed. But it is too tough to copy and restart the failed workflow. It will loss the state. [sic]
Correct me if I'm wrong, but I would think you could just copy the whole completed Workflow
(note that a workflow must be completed before you can retry it), give it a new uid
+ name
, then run the existing retry logic on the new copy.
We need to validate it. But my knowledge is all node id/pod/artifacts/params name is connected with workflow name. We need to test with controller if workflow name change with status how it will behave
i believe another uid row is stored in public.argo_archived_workflows or is that just for resubmit
Resubmit will start from whole workflow from beginning.
Summary
If a workflow was failed and successfully finished after retry, the failed workflow should be available for analysis and not just simply replaced by the successful workflow.
Use Cases
Image a complex DAG workflow with dozens of steps. At some point within the DAG, the workflow fails and you're under time pressure. So you just use the "Retry" button, one of the cool ArgoWF features. But afterwards it's impossibile to hand over the issue to a developer to analyze the error because the failed workflow is no longer available within the UI.
So it should be possible to have all workflows available for further investigation, especially if there are failed ones.
Message from the maintainers:
Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.