Summary

The suggested pattern of setting a Workflow level retryStrategy to tolerate pod deletion has unexpected side effects. If the workflow fails on an OnError step, after hitting the top level limit, the behavior will run the entire workflow again! Is this desirable or intended? https://argoproj.github.io/argo-workflows/tolerating-pod-deletion/

I would suggest a change or warning to the documentation as to this behavior or change the behavior to only cascade to child steps and not retry the entire workflow.

Use Cases

https://argoproj.github.io/argo-workflows/tolerating-pod-deletion/

Reproducing

Argo v2.12.3

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-pod-deletion-
spec:
  entrypoint: start
  retryStrategy:
    limit: 1
    backoff:
      duration: "15"
    retryPolicy: OnError
  templates:
  - name: start
    steps:
    - - name: first-sleep-1
        template: sleeper
      - name: first-sleep-2
        template: failer
      - name: first-sleep-3
        template: sleeper
    - - name: second-sleep-1
        template: sleeper
    - - name: third-sleep-1
        template: sleeper
  - name: sleeper
    container:
      image: alpine:latest
      command: ["sleep", "20"]
  - name: failer
    script:
      image: alpine:latest
      command: ["sh"]
      source: |
        sleep 30 && exit 1

Submit this workflow and delete first-sleep-2 twice. The entire workflow will run again.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

argoproj / argo-workflows

Misleading documentation for OnError at Workflow level #5073

Summary

Use Cases

Reproducing