Closed sarialalem1 closed 19 hours ago
I tested this purely on 3.5.6, and it fails if you attempt to retry this deliberately broken dag diamond.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
depends: "B && C"
template: eacho
arguments:
parameters: [{name: message, value: D}]
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
- name: eacho
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [eacho, "{{inputs.parameters.message}}"]
A link to the slack discussion: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1714641906410049
Another piece of info: After rolling back to 3.5.5, retrying runs the workflow propperly, but by the end it gets stuck without changing status to Finished
For me, the workflow above that reproduces the issue on 3.5.6 doesn't reproduce it on 3.5.5. It may be that retrying workflows which have been touched by 3.5.6 is part of the problem, so recreate them fresh instead.
Ignore that last comment, it doesn't go wrong for me in a really basic workflows installation at all. 3.5.6 will retry happily there. I'll try and determine what the difference is with our production 3.5.6 and why it only fails there.
Our production has a workflowDefaults
which enables retryStrategy
for everything.
The following reproduces the error - note the retryStrategy
at the top level template. Remove that or place retries elsewhere and it will work.
metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
retryStrategy:
limit: 2
retryPolicy: OnError
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
depends: "B && C"
template: eacho
arguments:
parameters: [{name: message, value: D}]
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
- name: eacho
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [eacho, "{{inputs.parameters.message}}"]
This works correctly with 3.5.5
This was broken by #12817.
I feel like this has got to be related to the root cause I mentioned in https://github.com/argoproj/argo-workflows/pull/12817#pullrequestreview-1995821336. Although the PR itself did not touch (automated) retry nodes. The manual retry logic needs a refactor in general.
We should also add all these failing test cases
I do think the retry node needs to be skipped when checking if the descendants have success nodes since it is virtual.
The following reproduces the error - note the
retryStrategy
at the top level template.
To clarify, this will happen even when no retry was needed, correct? or does it only occur if a retry is triggered?
As in, the existence of a retryStrategy
(with any configuration) on a template invocator (i.e. DAG or steps) causes it, not whether it were actually retried or not based on its retryPolicy
or expression
.
The following reproduces the error - note the
retryStrategy
at the top level template.To clarify, this will happen even when no retry was needed, correct? or does it only occur if a retry is triggered? As in, the existence of a
retryStrategy
(with any configuration) on a template invocator (i.e. DAG or steps) causes it, not whether it were actually retried or not based on itsretryPolicy
orexpression
.
This requires both a retryStrategy and a manual retry attempt, but the retryStrategy does not need to have been used, we just need the retry virtual node to be present. I don't believe the actual retryStrategy matters at all.
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
Retried some of the workflows Result:
Reproducible on any workflow
Version
v3.5.6
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container