Workflows that failed before upgrade to 3.5.6 fail to retry

sarialalem1 commented 2 weeks ago

Pre-requisites

[X] I have double-checked my configuration
[X] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
[X] I have searched existing issues and could not find a match for this bug
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

We were on 3.5.5
Some workflows failed
Upgraded to 3.5.6
Retried some of the workflows Result:
- The failed tasks within the workflows disappeared, keeping only the successful ones
- The workflows got stuck in a running state
- Still The workflows are shown in the failed list
- Unable to retry again, because it's running
- Unable to Stop

Reproducible on any workflow

Version

v3.5.6

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

any workflow reproduces it

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Joibel commented 2 weeks ago

I tested this purely on 3.5.6, and it fails if you attempt to retry this deliberately broken dag diamond.

apiVersion: argoproj.io/v1alpha1                                                                                                                                                                                                                
kind: Workflow   
metadata:
  generateName: dag-diamond-
spec:
  entrypoint: diamond
  templates:
  - name: diamond
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        depends: "B && C"
        template: eacho
        arguments:
          parameters: [{name: message, value: D}]

  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: eacho
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [eacho, "{{inputs.parameters.message}}"]

A link to the slack discussion: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1714641906410049

sarialalem1 commented 2 weeks ago

Another piece of info: After rolling back to 3.5.5, retrying runs the workflow propperly, but by the end it gets stuck without changing status to Finished

Joibel commented 2 weeks ago

For me, the workflow above that reproduces the issue on 3.5.6 doesn't reproduce it on 3.5.5. It may be that retrying workflows which have been touched by 3.5.6 is part of the problem, so recreate them fresh instead.

Joibel commented 2 weeks ago

Ignore that last comment, it doesn't go wrong for me in a really basic workflows installation at all. 3.5.6 will retry happily there. I'll try and determine what the difference is with our production 3.5.6 and why it only fails there.

Joibel commented 2 weeks ago

Our production has a workflowDefaults which enables retryStrategy for everything. The following reproduces the error - note the retryStrategy at the top level template. Remove that or place retries elsewhere and it will work.

metadata:
  generateName: dag-diamond-
spec:
  entrypoint: diamond
  templates:
  - name: diamond
    retryStrategy:
      limit: 2
      retryPolicy: OnError
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        depends: "B && C"
        template: eacho
        arguments:
          parameters: [{name: message, value: D}]

  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: eacho
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [eacho, "{{inputs.parameters.message}}"]

This works correctly with 3.5.5

Joibel commented 2 weeks ago

This was broken by #12817.

agilgur5 commented 2 weeks ago

I feel like this has got to be related to the root cause I mentioned in https://github.com/argoproj/argo-workflows/pull/12817#pullrequestreview-1995821336. Although the PR itself did not touch (automated) retry nodes. The manual retry logic needs a refactor in general.

We should also add all these failing test cases

shuangkun commented 2 weeks ago

I do think the retry node needs to be skipped when checking if the descendants have success nodes since it is virtual.

agilgur5 commented 2 weeks ago

The following reproduces the error - note the retryStrategy at the top level template.

To clarify, this will happen even when no retry was needed, correct? or does it only occur if a retry is triggered? As in, the existence of a retryStrategy (with any configuration) on a template invocator (i.e. DAG or steps) causes it, not whether it were actually retried or not based on its retryPolicy or expression.

Joibel commented 2 weeks ago

The following reproduces the error - note the retryStrategy at the top level template.

To clarify, this will happen even when no retry was needed, correct? or does it only occur if a retry is triggered? As in, the existence of a retryStrategy (with any configuration) on a template invocator (i.e. DAG or steps) causes it, not whether it were actually retried or not based on its retryPolicy or expression.

This requires both a retryStrategy and a manual retry attempt, but the retryStrategy does not need to have been used, we just need the retry virtual node to be present. I don't believe the actual retryStrategy matters at all.

argoproj / argo-workflows