argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

Retry a single stuck node in Running Workflow #13749

Closed shuangkun closed 1 month ago

shuangkun commented 1 month ago

Summary

Sometimes, a node in the workflow stuck (maybe the init container failed to execute successfully, it may be a disk problem or other reasons, etc.), which causes the workflow to keep running and cannot be executed. At this time, recreating the pod workflow can succeed. However, my workflow is integrated into an automation system, and other systems are monitoring this workflow. I donโ€™t want to retry this workflow after it fails. I want to restart this single task directly in the running state.

Use Cases

Some special situations cause the node to retry when it is stuck.


Message from the maintainers:

Love this feature request? Give it a ๐Ÿ‘. We prioritise the proposals with the most ๐Ÿ‘.

Joibel commented 1 month ago

I don't think we should add support for retrying running workflows due to complexity. We don't currently support it.

We currently have two retry mechanisms we do support. I'll use the words Manual retry is the action which is invoked on the argo server via the UI, argo cli or via a REST call. This requires a completed workflow (the docs for the CLI only say a failed workflow, but it works for any completed workflow. One workaround is therefore to stop and then retry the workflow, this achieves the stated goal but is slower.

We also support automated retry via the retryStrategy mechanism which I would argue is the correct mechanism in your case. I'm unsure why it isn't suitable, apart from being something you need to plan for rather than react to. I would rather improve automated retry until it works for you than add a new (and necessarily complex and therefore bug ridden) mechanism to the code.

What you are proposing is using the manual retry from automation. You could, as you already have automation in place which already has (or you're proposing it to have) sufficient privileges to rewrite a workflow make it do the necessary manipulations to the workflow object (and necessary other manipulations that occur during a manual retry such as pod deletion) to make it be the controller in this scenario.

I believe retries are already complex enough and have enough bugs that introducing another mechanism when there are workarounds would be the wrong thing to do.

isubasinghe commented 1 month ago

Yeah I agree with @Joibel here, the retry logic is complex and frankly the retry logic is hard to get right without effectively re-implementing a large chunk of the execute${TEMPLATE_TYPE} functions.

I strongly recommend not adding anymore features to the retry logic.

shuangkun commented 1 month ago

My current idea is to suspend first, then retry a specific node (clean up the node status), and then resume (rebuild the node).

Because my workflow is sometimes referenced by other teams or automation systems (such as Alibaba Cloud Cloudflow, AWS Step Function), I donโ€™t want this workflow to stop or fail, which will cause problems with my entire Cloudflow.

Some scenarios require manual intervention, which may be better.

Anyway, this is a discussion to see if anyone is interested.

jswxstw commented 1 month ago

Sometimes, a node in the workflow stuck (maybe the init container failed to execute successfully, it may be a disk problem or other reasons, etc.), which causes the workflow to keep running and cannot be executed. At this time, recreating the pod workflow can succeed.

Do you have a specific example? If the init container does not execute successfully, shouldn't the node be in a failed state?

agilgur5 commented 1 month ago

Do you have a specific example? If the init container does not execute successfully, shouldn't the node be in a failed state?

Yea this sounds like a bug, not a feature, same as I wrote in https://github.com/argoproj/argo-workflows/issues/13579#issuecomment-2339355171. The underlying bug should be fixed rather than adding a very hacky workaround (per Alan's comment above) that would create significant race conditions for the Controller (as I wrote in the linked comment).

I think with 3 approvers against this, we can solidly close this out as "not planned"

shuangkun commented 1 month ago

Sometimes, a node in the workflow stuck (maybe the init container failed to execute successfully, it may be a disk problem or other reasons, etc.), which causes the workflow to keep running and cannot be executed. At this time, recreating the pod workflow can succeed.

Do you have a specific example? If the init container does not execute successfully, shouldn't the node be in a failed state?

I really don't. It's a user problem I encountered. We have encountered it three times this year, but I really haven't found out why it is stuck. I can only locate that it is stuck during the init container log download phase. It is only encountered in production. I haven't reproduced it in the test environment.