argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.63k stars 3.13k forks source link

Retry does not work correctly with a DAG when there are steps which have dependencies #10425

Open RoryDoherty opened 1 year ago

RoryDoherty commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

If you have an argo workflow dag with the following:

flowchart TD
A[A - Setup test infrastructure] --> B[ B - Run Tests 1]
A --> C[C - Run Tests 2]
A --> D[D - Run Tests 2]
A --> E[E - Run Tests 2]
B --> F[OnExit - Cleanup test Infrastructure]
C --> F
D --> F
E --> F

In the event of everything passing except for step C I may want to only retry this step. However if I click Retry on the UI then it will only re run the C step which will obviously fail again as the infrastructure does not exist that step A sets up even though in the dag I have specified dependsOn A Also the OnExit steps will not run again in the event of a retry, is there a way to specify that these should always run even during a retry? More info can be found here https://github.com/argoproj/argo-workflows/discussions/7534

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: demo-
spec:
  entrypoint: demo
  onExit: cleanup
  templates:
    - name: demo
      dag:
        tasks:
          - name: a
            template: setup
          - name: b
            depends: a
            template: run-test
          - name: c
            depends: a
            template: fail-test
          - name: d
            depends: a
            template: run-test
          - name: e
            depends: a
            template: run-test

    - name: setup
      script:
        image: ubuntu:18.04
        imagePullPolicy: Always
        command: [bash]
        workingDir: "/src"
        source: |
          sleep 5
          echo Running setup step
          date
          exit 0

    - name: run-test
      script:
        image: ubuntu:18.04
        imagePullPolicy: Always
        command: [bash]
        workingDir: "/src"
        source: |
          sleep 5
          echo Running test success
          date
          exit 0

    - name: fail-test
      script:
        image: ubuntu:18.04
        imagePullPolicy: Always
        command: [bash]
        workingDir: "/src"
        source: |
          sleep 5
          echo Running test failure
          date
          exit 1

    - name: cleanup
      script:
        image: ubuntu:18.04
        imagePullPolicy: Always
        command: [bash]
        workingDir: "/src"
        source: |
          sleep 5
          echo Running cleanup step
          date
          exit 0

Logs from the workflow controller

time="2023-01-30T10:26:49.919Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:49.926Z" level=info msg="Updated phase  -> Running" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:49.926Z" level=info msg="DAG node demo-hsqz4 initialized Running" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:49.926Z" level=info msg="All of node demo-hsqz4.a dependencies [] completed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:49.926Z" level=info msg="Pod node demo-hsqz4-1639301744 initialized Pending" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:49.946Z" level=info msg="Created pod: demo-hsqz4.a (demo-hsqz4-setup-1639301744)" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:49.946Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:49.946Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:49.960Z" level=info msg="Workflow update successful" namespace=argoci phase=Running resourceVersion=862605919 workflow=demo-hsqz4
time="2023-01-30T10:26:59.947Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:59.948Z" level=info msg="Task-result reconciliation" namespace=argoci numObjs=0 workflow=demo-hsqz4
time="2023-01-30T10:26:59.948Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Running new.progress=0/1 nodeID=demo-hsqz4-1639301744 old.message= old.phase=Pending old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:26:59.949Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:59.949Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:26:59.954Z" level=info msg="cleaning up pod" action=terminateContainers key=argoci/demo-hsqz4-setup-1639301744/terminateContainers
time="2023-01-30T10:26:59.956Z" level=info msg="https://10.164.16.1:443/api/v1/namespaces/argoci/pods/demo-hsqz4-setup-1639301744/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2023-01-30T10:26:59.963Z" level=info msg="Workflow update successful" namespace=argoci phase=Running resourceVersion=862606089 workflow=demo-hsqz4
time="2023-01-30T10:27:00.239Z" level=info msg="signaled container" container=wait error="Internal error occurred: error executing command in container: failed to exec in container: failed to start exec \"3ecd4703ab024b084c15d9cb94eb4e8eec1254ba6fa2363c6968417757b157ba\": OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown" namespace=argoci pod=demo-hsqz4-setup-1639301744 stderr="<nil>" stdout="<nil>"
time="2023-01-30T10:27:09.964Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:09.965Z" level=info msg="Task-result reconciliation" namespace=argoci numObjs=1 workflow=demo-hsqz4
time="2023-01-30T10:27:09.965Z" level=info msg="task-result changed" namespace=argoci nodeID=demo-hsqz4-1639301744 workflow=demo-hsqz4
time="2023-01-30T10:27:09.965Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Succeeded new.progress=0/1 nodeID=demo-hsqz4-1639301744 old.message= old.phase=Running old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:09.965Z" level=info msg="All of node demo-hsqz4.b dependencies [a] completed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:09.965Z" level=info msg="Pod node demo-hsqz4-1689634601 initialized Pending" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:09.979Z" level=info msg="Created pod: demo-hsqz4.b (demo-hsqz4-run-test-1689634601)" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:09.980Z" level=info msg="All of node demo-hsqz4.c dependencies [a] completed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:09.980Z" level=info msg="Pod node demo-hsqz4-1672856982 initialized Pending" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:09.994Z" level=info msg="Created pod: demo-hsqz4.c (demo-hsqz4-fail-test-1672856982)" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:09.995Z" level=info msg="All of node demo-hsqz4.d dependencies [a] completed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:09.995Z" level=info msg="Pod node demo-hsqz4-1723189839 initialized Pending" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:10.008Z" level=info msg="Created pod: demo-hsqz4.d (demo-hsqz4-run-test-1723189839)" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:10.008Z" level=info msg="All of node demo-hsqz4.e dependencies [a] completed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:10.008Z" level=info msg="Pod node demo-hsqz4-1706412220 initialized Pending" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:10.024Z" level=info msg="Created pod: demo-hsqz4.e (demo-hsqz4-run-test-1706412220)" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:10.024Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:10.024Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:10.043Z" level=info msg="Workflow update successful" namespace=argoci phase=Running resourceVersion=862606251 workflow=demo-hsqz4
time="2023-01-30T10:27:15.045Z" level=info msg="cleaning up pod" action=deletePod key=argoci/demo-hsqz4-setup-1639301744/deletePod
time="2023-01-30T10:27:19.981Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:19.982Z" level=info msg="Task-result reconciliation" namespace=argoci numObjs=2 workflow=demo-hsqz4
time="2023-01-30T10:27:19.983Z" level=info msg="task-result changed" namespace=argoci nodeID=demo-hsqz4-1672856982 workflow=demo-hsqz4
time="2023-01-30T10:27:19.983Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Running new.progress=0/1 nodeID=demo-hsqz4-1706412220 old.message= old.phase=Pending old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:19.983Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Running new.progress=0/1 nodeID=demo-hsqz4-1723189839 old.message= old.phase=Pending old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:19.983Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Running new.progress=0/1 nodeID=demo-hsqz4-1672856982 old.message= old.phase=Pending old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:19.983Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Running new.progress=0/1 nodeID=demo-hsqz4-1689634601 old.message= old.phase=Pending old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:19.984Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:19.984Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:19.988Z" level=info msg="cleaning up pod" action=terminateContainers key=argoci/demo-hsqz4-run-test-1706412220/terminateContainers
time="2023-01-30T10:27:19.988Z" level=info msg="cleaning up pod" action=terminateContainers key=argoci/demo-hsqz4-run-test-1723189839/terminateContainers
time="2023-01-30T10:27:19.988Z" level=info msg="https://10.164.16.1:443/api/v1/namespaces/argoci/pods/demo-hsqz4-run-test-1706412220/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2023-01-30T10:27:19.989Z" level=info msg="cleaning up pod" action=terminateContainers key=argoci/demo-hsqz4-run-test-1689634601/terminateContainers
time="2023-01-30T10:27:19.989Z" level=info msg="https://10.164.16.1:443/api/v1/namespaces/argoci/pods/demo-hsqz4-run-test-1723189839/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2023-01-30T10:27:19.988Z" level=info msg="cleaning up pod" action=terminateContainers key=argoci/demo-hsqz4-fail-test-1672856982/terminateContainers
time="2023-01-30T10:27:19.992Z" level=info msg="https://10.164.16.1:443/api/v1/namespaces/argoci/pods/demo-hsqz4-run-test-1689634601/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2023-01-30T10:27:19.997Z" level=info msg="Workflow update successful" namespace=argoci phase=Running resourceVersion=862606435 workflow=demo-hsqz4
time="2023-01-30T10:27:20.141Z" level=info msg="signaled container" container=wait error="Internal error occurred: error executing command in container: failed to exec in container: failed to start exec \"1eed666386f2d2cd6a2213706483922cb6b4aa3412e9ac74c660eb80d85b6ba3\": OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown" namespace=argoci pod=demo-hsqz4-run-test-1689634601 stderr="<nil>" stdout="<nil>"
time="2023-01-30T10:27:20.167Z" level=info msg="signaled container" container=wait error="Internal error occurred: error executing command in container: failed to exec in container: failed to start exec \"ac9f0a0d61ac99d727d135a3bd718a88f6d027749e653a20bef29cdcc9bf9f86\": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: read init-p: connection reset by peer: unknown" namespace=argoci pod=demo-hsqz4-run-test-1706412220 stderr="<nil>" stdout="<nil>"
time="2023-01-30T10:27:20.179Z" level=info msg="signaled container" container=wait error="Internal error occurred: error executing command in container: failed to exec in container: failed to start exec \"f37a06b6328784cfdf863f338ad96bbbf29f57d2a5abeecce960045342cbe6a8\": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: process_linux.go:130: executing setns process caused: exit status 1: unknown" namespace=argoci pod=demo-hsqz4-run-test-1723189839 stderr="<nil>" stdout="<nil>"
time="2023-01-30T10:27:29.997Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:29.999Z" level=info msg="Task-result reconciliation" namespace=argoci numObjs=5 workflow=demo-hsqz4
time="2023-01-30T10:27:29.999Z" level=info msg="task-result changed" namespace=argoci nodeID=demo-hsqz4-1723189839 workflow=demo-hsqz4
time="2023-01-30T10:27:29.999Z" level=info msg="task-result changed" namespace=argoci nodeID=demo-hsqz4-1706412220 workflow=demo-hsqz4
time="2023-01-30T10:27:29.999Z" level=info msg="task-result changed" namespace=argoci nodeID=demo-hsqz4-1689634601 workflow=demo-hsqz4
time="2023-01-30T10:27:29.999Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Succeeded new.progress=0/1 nodeID=demo-hsqz4-1706412220 old.message= old.phase=Running old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:29.999Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Succeeded new.progress=0/1 nodeID=demo-hsqz4-1723189839 old.message= old.phase=Running old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:29.999Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Succeeded new.progress=0/1 nodeID=demo-hsqz4-1689634601 old.message= old.phase=Running old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:29.999Z" level=info msg="node changed" namespace=argoci new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=demo-hsqz4-1672856982 old.message= old.phase=Running old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:30.000Z" level=info msg="Outbound nodes of demo-hsqz4 set to [demo-hsqz4-1689634601 demo-hsqz4-1672856982 demo-hsqz4-1723189839 demo-hsqz4-1706412220]" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:30.000Z" level=info msg="node demo-hsqz4 phase Running -> Failed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:30.000Z" level=info msg="node demo-hsqz4 finished: 2023-01-30 10:27:30.000828103 +0000 UTC" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:30.000Z" level=info msg="Checking daemoned children of demo-hsqz4" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:30.000Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:30.001Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:30.001Z" level=info msg="Running OnExit handler: cleanup" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:30.001Z" level=info msg="Pod node demo-hsqz4-1813653132 initialized Pending" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:30.022Z" level=info msg="Created pod: demo-hsqz4.onExit (demo-hsqz4-cleanup-1813653132)" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:30.036Z" level=info msg="Workflow update successful" namespace=argoci phase=Running resourceVersion=862606609 workflow=demo-hsqz4
time="2023-01-30T10:27:30.239Z" level=info msg="cleaning up pod" action=killContainers key=argoci/demo-hsqz4-setup-1639301744/killContainers
time="2023-01-30T10:27:35.038Z" level=info msg="cleaning up pod" action=deletePod key=argoci/demo-hsqz4-fail-test-1672856982/deletePod
time="2023-01-30T10:27:35.038Z" level=info msg="cleaning up pod" action=deletePod key=argoci/demo-hsqz4-run-test-1689634601/deletePod
time="2023-01-30T10:27:35.038Z" level=info msg="cleaning up pod" action=deletePod key=argoci/demo-hsqz4-run-test-1723189839/deletePod
time="2023-01-30T10:27:35.038Z" level=info msg="cleaning up pod" action=deletePod key=argoci/demo-hsqz4-run-test-1706412220/deletePod
time="2023-01-30T10:27:40.023Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:40.024Z" level=info msg="Task-result reconciliation" namespace=argoci numObjs=6 workflow=demo-hsqz4
time="2023-01-30T10:27:40.025Z" level=info msg="task-result changed" namespace=argoci nodeID=demo-hsqz4-1813653132 workflow=demo-hsqz4
time="2023-01-30T10:27:40.025Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Running new.progress=0/1 nodeID=demo-hsqz4-1813653132 old.message= old.phase=Pending old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:40.025Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:40.025Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:40.025Z" level=info msg="Running OnExit handler: cleanup" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:40.030Z" level=info msg="cleaning up pod" action=terminateContainers key=argoci/demo-hsqz4-cleanup-1813653132/terminateContainers
time="2023-01-30T10:27:40.038Z" level=info msg="Workflow update successful" namespace=argoci phase=Running resourceVersion=862606761 workflow=demo-hsqz4
time="2023-01-30T10:27:49.990Z" level=info msg="cleaning up pod" action=killContainers key=argoci/demo-hsqz4-fail-test-1672856982/killContainers
time="2023-01-30T10:27:50.040Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:50.042Z" level=info msg="Task-result reconciliation" namespace=argoci numObjs=6 workflow=demo-hsqz4
time="2023-01-30T10:27:50.042Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Succeeded new.progress=0/1 nodeID=demo-hsqz4-1813653132 old.message= old.phase=Running old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:27:50.042Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:50.042Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:50.042Z" level=info msg="Running OnExit handler: cleanup" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:50.042Z" level=info msg="Updated phase Running -> Failed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:50.042Z" level=info msg="Marking workflow completed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:50.042Z" level=info msg="Marking workflow as pending archiving" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:50.042Z" level=info msg="Checking daemoned children of " namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:27:50.048Z" level=info msg="cleaning up pod" action=deletePod key=argoci/demo-hsqz4-1340600742-agent/deletePod
time="2023-01-30T10:27:50.058Z" level=info msg="Workflow update successful" namespace=argoci phase=Failed resourceVersion=862606889 workflow=demo-hsqz4
time="2023-01-30T10:27:50.142Z" level=info msg="cleaning up pod" action=killContainers key=argoci/demo-hsqz4-run-test-1689634601/killContainers
time="2023-01-30T10:27:50.145Z" level=info msg="archiving workflow" namespace=argoci uid=3719303d-fee6-4419-8d06-a068474027e9 workflow=demo-hsqz4
time="2023-01-30T10:27:50.167Z" level=info msg="cleaning up pod" action=killContainers key=argoci/demo-hsqz4-run-test-1706412220/killContainers
time="2023-01-30T10:27:50.180Z" level=info msg="cleaning up pod" action=killContainers key=argoci/demo-hsqz4-run-test-1723189839/killContainers
time="2023-01-30T10:27:50.203Z" level=info msg="Queueing Failed workflow argoci/demo-hsqz4 for delete in 168h0m0s due to TTL"
time="2023-01-30T10:27:55.144Z" level=info msg="cleaning up pod" action=deletePod key=argoci/demo-hsqz4-cleanup-1813653132/deletePod
time="2023-01-30T10:28:10.031Z" level=info msg="cleaning up pod" action=killContainers key=argoci/demo-hsqz4-cleanup-1813653132/killContainers
time="2023-01-30T10:28:41.285Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:28:41.286Z" level=info msg="Task-result reconciliation" namespace=argoci numObjs=0 workflow=demo-hsqz4
time="2023-01-30T10:28:41.286Z" level=info msg="All of node demo-hsqz4.c dependencies [a] completed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:28:41.287Z" level=info msg="Pod node demo-hsqz4-1672856982 initialized Pending" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:28:41.306Z" level=info msg="Created pod: demo-hsqz4.c (demo-hsqz4-fail-test-1672856982)" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:28:41.306Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:28:41.306Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:28:41.324Z" level=info msg="Workflow update successful" namespace=argoci phase=Running resourceVersion=862607629 workflow=demo-hsqz4
time="2023-01-30T10:28:51.308Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:28:51.310Z" level=info msg="Task-result reconciliation" namespace=argoci numObjs=1 workflow=demo-hsqz4
time="2023-01-30T10:28:51.310Z" level=info msg="task-result changed" namespace=argoci nodeID=demo-hsqz4-1672856982 workflow=demo-hsqz4
time="2023-01-30T10:28:51.310Z" level=info msg="node changed" namespace=argoci new.message= new.phase=Running new.progress=0/1 nodeID=demo-hsqz4-1672856982 old.message= old.phase=Pending old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:28:51.310Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:28:51.310Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:28:51.315Z" level=info msg="cleaning up pod" action=terminateContainers key=argoci/demo-hsqz4-fail-test-1672856982/terminateContainers
time="2023-01-30T10:28:51.324Z" level=info msg="Workflow update successful" namespace=argoci phase=Running resourceVersion=862607766 workflow=demo-hsqz4
time="2023-01-30T10:29:02.867Z" level=info msg="Processing workflow" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.869Z" level=info msg="Task-result reconciliation" namespace=argoci numObjs=1 workflow=demo-hsqz4
time="2023-01-30T10:29:02.869Z" level=info msg="node changed" namespace=argoci new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=demo-hsqz4-1672856982 old.message= old.phase=Running old.progress=0/1 workflow=demo-hsqz4
time="2023-01-30T10:29:02.869Z" level=info msg="Outbound nodes of demo-hsqz4 set to [demo-hsqz4-1689634601 demo-hsqz4-1672856982 demo-hsqz4-1723189839 demo-hsqz4-1706412220]" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.869Z" level=info msg="node demo-hsqz4 phase Running -> Failed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.869Z" level=info msg="node demo-hsqz4 finished: 2023-01-30 10:29:02.869989182 +0000 UTC" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.870Z" level=info msg="Checking daemoned children of demo-hsqz4" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.870Z" level=info msg="TaskSet Reconciliation" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.870Z" level=info msg=reconcileAgentPod namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.870Z" level=info msg="Running OnExit handler: cleanup" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.870Z" level=info msg="Updated phase Running -> Failed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.870Z" level=info msg="Marking workflow completed" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.870Z" level=info msg="Marking workflow as pending archiving" namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.870Z" level=info msg="Checking daemoned children of " namespace=argoci workflow=demo-hsqz4
time="2023-01-30T10:29:02.876Z" level=info msg="cleaning up pod" action=deletePod key=argoci/demo-hsqz4-1340600742-agent/deletePod
time="2023-01-30T10:29:02.886Z" level=info msg="Workflow update successful" namespace=argoci phase=Failed resourceVersion=862607942 workflow=demo-hsqz4
time="2023-01-30T10:29:02.933Z" level=info msg="archiving workflow" namespace=argoci uid=3719303d-fee6-4419-8d06-a068474027e9 workflow=demo-hsqz4
time="2023-01-30T10:29:02.971Z" level=info msg="Queueing Failed workflow argoci/demo-hsqz4 for delete in 168h0m0s due to TTL"
time="2023-01-30T10:29:07.934Z" level=info msg="cleaning up pod" action=deletePod key=argoci/demo-hsqz4-fail-test-1672856982/deletePod
time="2023-01-30T10:29:21.316Z" level=info msg="cleaning up pod" action=killContainers key=argoci/demo-hsqz4-fail-test-1672856982/killContainers

Logs from in your workflow's wait container

N/A
stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

RoryDoherty commented 1 year ago

This is still an issue

terrytangyuan commented 1 year ago

The relevant code is in https://github.com/argoproj/argo-workflows/blob/master/workflow/util/util.go#L804

Would anyone like to submit a PR to fix this?

RoryDoherty commented 1 year ago

Thanks for pointing me in the right direction, I'll take a stab at this next week if I get a chance :+1:

RoryDoherty commented 1 year ago

@terrytangyuan I've managed to update the code in FormulateRetryWorkflow to return a workflow with successful dependencies removed from the dag and this works well with unit tests

However I'm now running into an issue buildLocalScopeFromTask specifically on this line https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/dag.go#L588 where because some of the subsequent tasks have completed and the initial dependent task is not available yet the workflow fails to be submitted Would you have any ideas of how to get around this? In FormulateRetryWorkflow is there a way of telling the step to run again without removing it?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

RoryDoherty commented 1 year ago

This is not stale, I just need some guidance

stale[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

RoryDoherty commented 10 months ago

Not stale, need help in how to proceed

wesleyscholl commented 10 months ago

We are also experiencing a similar issue using steps.

wesleyscholl commented 10 months ago

@terrytangyuan I've managed to update the code in FormulateRetryWorkflow to return a workflow with successful dependencies removed from the dag and this works well with unit tests

However I'm now running into an issue buildLocalScopeFromTask specifically on this line https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/dag.go#L588 where because some of the subsequent tasks have completed and the initial dependent task is not available yet the workflow fails to be submitted Would you have any ideas of how to get around this? In FormulateRetryWorkflow is there a way of telling the step to run again without removing it?

--

Can you send a link to your code? I'd like to take a look and help if I can. Thanks

RoryDoherty commented 9 months ago

@wesleyscholl I've just pushed my local code to here https://github.com/RoryDoherty/argo-workflows/tree/fix-retry-dag

The code requires a rebase but there are conflicts which I haven't time right now to resolve but it should give you an idea of what I was attempting