argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

templateReferencing: Secure causing workflows to become stuck when workflow templates change #13850

Open coreyhinkle opened 2 weeks ago

coreyhinkle commented 2 weeks ago

Pre-requisites

What happened? What did you expect to happen?

When a workflow template is changed while a workflow is running in templateReferencing: Secure mode I expect workflows to fail. What I've seen is if a workflow is in the last step when it is edited, the workflow gets stuck in a constant running state after the script completes.

I was able to reproduce this by using the below workflow, waiting for it to hit the sleep, and then adding echo "test" after the sleep.

Version(s)

v3.5.10, v.3.5.12, c702ab72433eb8cd26db07f0025dceba91e5e994c8071b0df89b27b63a73f0d2

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

kind: WorkflowTemplate
metadata:
  name: test
  namespace: argo
spec:
  entrypoint: bash-script-example
  templates:
  - name: bash-script-example
    steps:
    - - name: print
        template: print-message
  - name: print-message
    script:
      image: busybox
      command: ["sh"]
      source: |
        echo "about to sleep"
        sleep 60
        echo "done sleeping"

Logs from the workflow controller

time="2024-11-01T15:21:02.866Z" level=info msg="Processing workflow" Phase= ResourceVersion=2198451 namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.880Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=test-f9gk9
time="2024-11-01T15:21:02.880Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.880Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.880Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.880Z" level=info msg="Steps node test-f9gk9 initialized Running" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.880Z" level=info msg="StepGroup node test-f9gk9-2828767984 initialized Running" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.880Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.880Z" level=info msg="Pod node test-f9gk9-1379899069 initialized Pending" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.905Z" level=info msg="Created pod: test-f9gk9[0].print (test-f9gk9-print-message-1379899069)" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.905Z" level=info msg="Workflow step group node test-f9gk9-2828767984 not yet completed" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.905Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.905Z" level=info msg=reconcileAgentPod namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:02.914Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=2198454 workflow=test-f9gk9
time="2024-11-01T15:21:12.867Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2198454 namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:12.867Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=test-f9gk9
time="2024-11-01T15:21:12.867Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=test-f9gk9-1379899069 old.message= old.phase=Pending old.progress=0/1 workflow=test-f9gk9
time="2024-11-01T15:21:12.868Z" level=info msg="Workflow step group node test-f9gk9-2828767984 not yet completed" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:12.868Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:12.868Z" level=info msg=reconcileAgentPod namespace=argo workflow=test-f9gk9
time="2024-11-01T15:21:12.874Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=2198616 workflow=test-f9gk9
time="2024-11-01T15:22:16.126Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2198616 namespace=argo workflow=test-f9gk9
time="2024-11-01T15:22:16.127Z" level=error msg="Unable to set ExecWorkflow" error="WorkflowSpec may not change during execution when the controller is set `templateReferencing: Secure`" namespace=argo workflow=test-f9gk9

Logs from in your workflow's wait container

time="2024-11-01T15:22:06.452Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-11-01T15:22:06.452Z" level=info msg="No output parameters"
time="2024-11-01T15:22:06.452Z" level=info msg="No output artifacts"
time="2024-11-01T15:22:06.453Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: test-f9gk9/test-f9gk9-print-message-1379899069/main.log"
time="2024-11-01T15:22:06.453Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-11-01T15:22:06.453Z" level=info msg="Saving file to s3" bucket=my-bucket endpoint="minio:9000" key=test-f9gk9/test-f9gk9-print-message-1379899069/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-11-01T15:22:06.462Z" level=info msg="Save artifact" artifactName=main-logs duration=8.99804ms error="<nil>" key=test-f9gk9/test-f9gk9-print-message-1379899069/main.log
time="2024-11-01T15:22:06.462Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-11-01T15:22:06.462Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-11-01T15:22:06.476Z" level=info msg="Alloc=9151 TotalAlloc=17202 Sys=24149 NumGC=5 Goroutines=10"
shuangkun commented 2 weeks ago

It seems to be related to the workflowtaskresult not being completed. After encountering "WorkflowSpec may not change during execution when the controller is set `templateReferencing: Secure", I saw that the workflow-controller executed

        err := woc.setStoredWfSpec()
        if err != nil {
            woc.markWorkflowError(ctx, err)
            return err
        }