argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.86k stars 3.17k forks source link

Misleading documentation for OnError at Workflow level #5073

Open endzyme opened 3 years ago

endzyme commented 3 years ago

Summary

The suggested pattern of setting a Workflow level retryStrategy to tolerate pod deletion has unexpected side effects. If the workflow fails on an OnError step, after hitting the top level limit, the behavior will run the entire workflow again! Is this desirable or intended? https://argoproj.github.io/argo-workflows/tolerating-pod-deletion/

I would suggest a change or warning to the documentation as to this behavior or change the behavior to only cascade to child steps and not retry the entire workflow.

Use Cases

https://argoproj.github.io/argo-workflows/tolerating-pod-deletion/

Reproducing

Argo v2.12.3

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-pod-deletion-
spec:
  entrypoint: start
  retryStrategy:
    limit: 1
    backoff:
      duration: "15"
    retryPolicy: OnError
  templates:
  - name: start
    steps:
    - - name: first-sleep-1
        template: sleeper
      - name: first-sleep-2
        template: failer
      - name: first-sleep-3
        template: sleeper
    - - name: second-sleep-1
        template: sleeper
    - - name: third-sleep-1
        template: sleeper
  - name: sleeper
    container:
      image: alpine:latest
      command: ["sleep", "20"]
  - name: failer
    script:
      image: alpine:latest
      command: ["sh"]
      source: |
        sleep 30 && exit 1

Submit this workflow and delete first-sleep-2 twice. The entire workflow will run again.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

mahi072 commented 1 year ago

I would like to work on this issue.