argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

Failed workflow should be available for analysis even after successful retry #13839

Open starwarsfan opened 3 weeks ago

starwarsfan commented 3 weeks ago

Summary

If a workflow was failed and successfully finished after retry, the failed workflow should be available for analysis and not just simply replaced by the successful workflow.

Use Cases

Image a complex DAG workflow with dozens of steps. At some point within the DAG, the workflow fails and you're under time pressure. So you just use the "Retry" button, one of the cool ArgoWF features. But afterwards it's impossibile to hand over the issue to a developer to analyze the error because the failed workflow is no longer available within the UI.

So it should be possible to have all workflows available for further investigation, especially if there are failed ones.


Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.

Joibel commented 3 weeks ago

This would be a fairly radical departure for how manual retries are implemented. Perhaps a way to satisfy the needs would be to do something with archiving the failed workflow before retry, but that also is not simple as the key is the name of the workflow which doesn't change during a manual retry.

agilgur5 commented 3 weeks ago

See also https://github.com/argoproj/argo-workflows/issues/12324#issuecomment-2364753512 / https://github.com/argoproj/argo-workflows/pull/9141#issuecomment-2077864002.

The current way manual retries are implemented is itself different from the rest of workflow operations, which try to record things on the Workflow resource more immutably / append-only (neither of those terms is quite accurate, but conceptually similar). Perhaps it should be done more similar to a resubmit that retains partial state

For example, manual re-runs on GitHub Actions work similarly. It creates a new state without deleting the old state and removes the state that is to be re-ran.

Logs and pods (including labels) etc would be partially linked between the two, which is a bit confusing and may create some race conditions, but possibly solveable

sarabala1979 commented 3 weeks ago

agreed. But it is too tough to copy and restart the failed workflow. It will loss the state. I think we need to find the way to preserve the failed nodes/steps and start the new nodes/steps like inject the retry flag or something

tooptoop4 commented 3 weeks ago

i believe another uid row is stored in public.argo_archived_workflows or is that just for resubmit

agilgur5 commented 3 weeks ago

agreed. But it is too tough to copy and restart the failed workflow. It will loss the state. [sic]

Correct me if I'm wrong, but I would think you could just copy the whole completed Workflow (note that a workflow must be completed before you can retry it), give it a new uid + name, then run the existing retry logic on the new copy.

sarabala1979 commented 3 weeks ago

We need to validate it. But my knowledge is all node id/pod/artifacts/params name is connected with workflow name. We need to test with controller if workflow name change with status how it will behave

sarabala1979 commented 3 weeks ago

i believe another uid row is stored in public.argo_archived_workflows or is that just for resubmit

Resubmit will start from whole workflow from beginning.