argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.43k stars 3.09k forks source link

Retrying workflows have short deadline ignoring activeDeadlineSeconds #13044

Closed leesungbin closed 1 week ago

leesungbin commented 2 weeks ago

Pre-requisites

What happened/what did you expect to happen?

I expect that the activeDeadlineSeconds parameter should also apply to retrying workflow pod containers. However, it appears that the retrying container has a short deadline of approximately 1 minute.

Additionally, when I set the activeDeadlineSeconds at the template level, the deadline for the first workflow's wait container is set to 0001-01-01 00:00:00 +0000 UTC, while the second workflow's wait container has a deadline of about 1 minute.

Is this behavior intentional?

Below are the logs resulting from setting the workflow level activeDeadlineSeconds.

Name:                retry-sample-knssl
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Failed
Message:             Max duration limit exceeded
Conditions:
 PodRunning          False
 Completed           True
Created:             Mon May 13 11:42:44 +0900 (3 minutes ago)
Started:             Mon May 13 11:42:44 +0900 (3 minutes ago)
Finished:            Mon May 13 11:45:55 +0900 (now)
Duration:            3 minutes 11 seconds
Progress:            0/2
ResourcesDuration:   17s*(1 cpu),4m29s*(100Mi memory)

STEP                        TEMPLATE       PODNAME                                      DURATION  MESSAGE
 ✖ retry-sample-knssl       retry-example                                                         Max duration limit exceeded
 ├─✖ retry-sample-knssl(0)  retry-example  retry-sample-knssl-retry-example-2352189541  2m        Error (exit code 10)
 └─✖ retry-sample-knssl(1)  retry-example  retry-sample-knssl-retry-example-2151005208  47s       Error (exit code 143)

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-sample-
spec:
  activeDeadlineSeconds: 3600
  entrypoint: retry-example
  templates:
    - name: retry-example
      retryStrategy:
        limit: 2
        backoff:
          duration: 10s
          factor: 2
          maxDuration: 3m
      container:
        image: alpine
        command: [sh, -c]
        args: ["sleep 120; exit 10"]

Logs from the workflow controller

time="2024-05-13T02:42:44.653Z" level=info msg="Processing workflow" Phase= ResourceVersion=1899 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.661Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=retry-sample-knssl
time="2024-05-13T02:42:44.661Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.661Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.662Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.662Z" level=info msg="Retry node retry-sample-knssl initialized Running" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.662Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.662Z" level=info msg="Pod node retry-sample-knssl-2352189541 initialized Pending" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.686Z" level=info msg="Created pod: retry-sample-knssl(0) (retry-sample-knssl-retry-example-2352189541)" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.686Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.686Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.694Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1902 workflow=retry-sample-knssl
time="2024-05-13T02:42:54.659Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1902 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:54.660Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-sample-knssl
time="2024-05-13T02:42:54.660Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:42:54.661Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=retry-sample-knssl-2352189541 old.message= old.phase=Pending old.progress=0/1 workflow=retry-sample-knssl
time="2024-05-13T02:42:54.661Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:54.661Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:54.687Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1928 workflow=retry-sample-knssl
time="2024-05-13T02:43:04.695Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1928 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:04.697Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-sample-knssl
time="2024-05-13T02:43:04.697Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:43:04.698Z" level=info msg="node unchanged" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:43:04.701Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:04.701Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:04.728Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1935 workflow=retry-sample-knssl
time="2024-05-13T02:43:27.542Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1935 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:27.544Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-sample-knssl
time="2024-05-13T02:43:27.544Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:43:27.545Z" level=info msg="node unchanged" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:43:27.552Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:27.552Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:27.578Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1935 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.808Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1935 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.809Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.809Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.810Z" level=info msg="Pod failed: Error (exit code 10)" displayName="retry-sample-knssl(0)" namespace=argo pod=retry-sample-knssl-retry-example-2352189541 templateName=retry-example workflow=retry-sample-knssl
time="2024-05-13T02:44:58.810Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 10)" new.phase=Failed new.progress=0/1 nodeID=retry-sample-knssl-2352189541 old.message= old.phase=Running old.progress=0/1 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.811Z" level=info msg="node has maxDuration set, setting executionDeadline to: Mon May 13 02:45:44 +0000 (45 seconds from now)" namespace=argo node=retry-sample-knssl workflow=retry-sample-knssl
time="2024-05-13T02:44:58.811Z" level=info msg="Retry Policy: OnFailure (onFailed: true, onError false)" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.812Z" level=info msg="1 child nodes of retry-sample-knssl failed. Trying again..." namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.812Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.812Z" level=info msg="Pod node retry-sample-knssl-2151005208 initialized Pending" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.856Z" level=info msg="Created pod: retry-sample-knssl(1) (retry-sample-knssl-retry-example-2151005208)" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.856Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.857Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.890Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1999 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.897Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/retry-sample-knssl-retry-example-2352189541/labelPodCompleted
time="2024-05-13T02:45:08.863Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1999 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:08.864Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-sample-knssl
time="2024-05-13T02:45:08.865Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:45:08.865Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2151005208 workflow=retry-sample-knssl
time="2024-05-13T02:45:08.866Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=retry-sample-knssl-2151005208 old.message= old.phase=Pending old.progress=0/1 workflow=retry-sample-knssl
time="2024-05-13T02:45:08.868Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:08.868Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:08.892Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=2022 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.935Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2022 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.938Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.938Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.939Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2151005208 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.940Z" level=info msg="Pod failed: Error (exit code 143)" displayName="retry-sample-knssl(1)" namespace=argo pod=retry-sample-knssl-retry-example-2151005208 templateName=retry-example workflow=retry-sample-knssl
time="2024-05-13T02:45:55.940Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 143)" new.phase=Failed new.progress=0/1 nodeID=retry-sample-knssl-2151005208 old.message= old.phase=Running old.progress=0/1 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="Max duration limit exceeded. Failing..." namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="node retry-sample-knssl phase Running -> Failed" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="node retry-sample-knssl message: Max duration limit exceeded" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="node retry-sample-knssl finished: 2024-05-13 02:45:55.941690803 +0000 UTC" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.942Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.942Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.942Z" level=info msg="Updated message  -> Max duration limit exceeded" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.942Z" level=info msg="Marking workflow completed" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.948Z" level=info msg="cleaning up pod" action=deletePod key=argo/retry-sample-knssl-1340600742-agent/deletePod
time="2024-05-13T02:45:55.966Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=2054 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.994Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/retry-sample-knssl-retry-example-2151005208/labelPodCompleted

Logs from in your workflow's wait container

* first workflow (has deadline with 1 hour)
time="2024-05-13T02:44:48.780Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-05-13T02:44:48.780Z" level=info msg="No output parameters"
time="2024-05-13T02:44:48.780Z" level=info msg="No output artifacts"
time="2024-05-13T02:44:48.781Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: retry-sample-knssl/retry-sample-knssl-retry-example-2352189541/main.log"
time="2024-05-13T02:44:48.781Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-05-13T02:44:48.781Z" level=info msg="Saving file to s3" bucket=my-bucket endpoint="minio:9000" key=retry-sample-knssl/retry-sample-knssl-retry-example-2352189541/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-05-13T02:44:48.791Z" level=info msg="Save artifact" artifactName=main-logs duration=10.050083ms error="<nil>" key=retry-sample-knssl/retry-sample-knssl-retry-example-2352189541/main.log
time="2024-05-13T02:44:48.791Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-05-13T02:44:48.791Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-05-13T02:44:48.803Z" level=info msg="Alloc=7730 TotalAlloc=13965 Sys=23909 NumGC=5 Goroutines=10"

* second workflow(has short deadline less than 1 minute)
time="2024-05-13T02:45:00.792Z" level=info msg="Starting Workflow Executor" version=v3.5.6
time="2024-05-13T02:45:00.795Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-05-13T02:45:00.795Z" level=info msg="Executor initialized" deadline="2024-05-13 02:45:44 +0000 UTC" includeScriptOutput=false namespace=argo podName=retry-sample-knssl-retry-example-2151005208 templateName=retry-example version="&Version{Version:v3.5.6,BuildDate:2024-04-19T20:56:53Z,GitCommit:555030053825dd61689a086cb3c2da329419325a,GitTag:v3.5.6,GitTreeState:clean,GoVersion:go1.21.9,Compiler:gc,Platform:linux/arm64,}"
time="2024-05-13T02:45:00.806Z" level=info msg="Starting deadline monitor"
time="2024-05-13T02:45:44.001Z" level=info msg="Step exceeded its deadline"
time="2024-05-13T02:45:44.004Z" level=info msg="Killing containers"
time="2024-05-13T02:45:45.866Z" level=info msg="Main container completed" error="<nil>"
time="2024-05-13T02:45:45.866Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-05-13T02:45:45.866Z" level=info msg="No output parameters"
time="2024-05-13T02:45:45.866Z" level=info msg="No output artifacts"
time="2024-05-13T02:45:45.867Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: retry-sample-knssl/retry-sample-knssl-retry-example-2151005208/main.log"
leesungbin commented 2 weeks ago

I want to confirm if I am misunderstanding the maxDuration parameter.

In the documentation, it states, "maxDuration is the maximum amount of time allowed for a workflow in the backoff strategy." Based on this, I thought that maxDuration only affects the backoff process.

However, in the code, I noticed that opts.executionDeadline is overridden by maxDurationDeadline. https://github.com/argoproj/argo-workflows/blob/9cacef302bca869da1700b72bede1afad0e9526d/workflow/controller/operator.go#L1054

Is this intentional? If so, I don't think this is a bug. If not, I believe some tests should also be updated. (ex, func TestPropagateMaxDurationProcess in workflow/controller/operator_test.go)

agilgur5 commented 5 days ago

We may want to reconsider this later within the context of #10341, i.e. if pending and active time are split in two. The backoff maxDuration may make more sense for the pending state / pendingTimeout only, while activeDeadlineSeconds may make more sense to only apply to the active state.