argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.91k stars 3.18k forks source link

DAG exit-handler stuck when workflow with image pull failure is stopped. #13412

Open svenetjok opened 2 months ago

svenetjok commented 2 months ago

Pre-requisites

What happened? What did you expect to happen?

Stopping a workflow with its step in ErrImagePull/ErrImageNeverPull state causes a DAG format exit handler to be stuck. The exit-handler step may or may not be invoked (as in whether the main-container is run or not), in the end the exit-handler is left either in Pending or in Running state. Garbage collecting of Pods does not happen either as workflow does not seem to end its execution, although its state is indicated as Failed.

Invoking a plain template or a steps format template as exit-handler seems to not have this issue, problem is reproducible when a DAG format exit handler is used. To be noted is that similar symptoms have been occasionally observed with DAG exit-handler when stopping a workflow where steps have been normally running with container images available, where stopping may lead to exit-handler also be stuck in Running state. That has not been reproducible consistently with a test workflow but the symptoms have been similar to this case. Reporting this also in hoping that maybe such issues get also resolved if there is a possible fix for this problem.

Attached minimal workflow demonstrates the problem with DAG exit handler template. The "say-hello" step invokes a template with a non-existing image tag parameter and the template has imagePullPolicy: Never to create ErrImageNeverPull status for the main-container. Stopping the workflow at this state produces the described situation. Having "alpine:latest" image in container cache should be sufficient to run this workflow.

The workflow example has also other options made available for invoking different strategy exit-handlers that can demonstrate that both plain template and a steps template work OK as exit-handler in described situation. In the real-world use case the DAG format is needed due to the complexity of the exit-handler logic needed.

Expectation is that when a workflow is stopped by user, a DAG exit-handler is run and it can be completed successfully. This should happen regardless at which state the workflow steps are at that moment.

From workflow instance data it can be seen that exit-handler steps are in Failed state, and the last step is stuck as Running when expectation is that all exit-handler steps should complete with Succeeded status.

  nodes:
    test-exit-handler-workflow:
      children:
      - test-exit-handler-workflow-2006132045
      displayName: test-exit-handler-workflow
      finishedAt: "2024-07-29T15:35:39Z"
      id: test-exit-handler-workflow
      name: test-exit-handler-workflow
      outboundNodes:
      - test-exit-handler-workflow-2006132045
      phase: Failed
      progress: 0/1
      startedAt: "2024-07-29T15:35:30Z"
      templateName: workflow-startup
      templateScope: local/test-exit-handler-workflow
      type: DAG
    test-exit-handler-workflow-316667194:
      boundaryID: test-exit-handler-workflow-3293460608
      children:
      - test-exit-handler-workflow-4122663166
      displayName: exit-handler-step
      finishedAt: "2024-07-29T15:35:39Z"
      id: test-exit-handler-workflow-316667194
      inputs:
        parameters:
        - name: time-value
          value: "1"
      name: test-exit-handler-workflow.onExit[0].exit-handler-step
      outboundNodes:
      - test-exit-handler-workflow-4122663166
      phase: Failed
      progress: 0/1
      startedAt: "2024-07-29T15:35:39Z"
      templateName: handle-exit-dag
      templateScope: local/test-exit-handler-workflow
      type: DAG
    test-exit-handler-workflow-1323598490:
      boundaryID: test-exit-handler-workflow-3293460608
      children:
      - test-exit-handler-workflow-316667194
      displayName: '[0]'
      finishedAt: "2024-07-29T15:35:39Z"
      id: test-exit-handler-workflow-1323598490
      message: child 'test-exit-handler-workflow-316667194' failed
      name: test-exit-handler-workflow.onExit[0]
      nodeFlag: {}
      phase: Failed
      progress: 0/1
      startedAt: "2024-07-29T15:35:39Z"
      templateScope: local/test-exit-handler-workflow
      type: StepGroup
    test-exit-handler-workflow-2006132045:
      boundaryID: test-exit-handler-workflow
      displayName: say-hello
      finishedAt: "2024-07-29T15:35:39Z"
      hostNodeName: node1
      id: test-exit-handler-workflow-2006132045
      inputs:
        parameters:
        - name: message
          value: Hi!
        - default: alpine:latest
          name: image
          value: alpine:notexist
      message: 'workflow shutdown with strategy:  Stop'
      name: test-exit-handler-workflow.say-hello
      phase: Failed
      progress: 0/1
      startedAt: "2024-07-29T15:35:30Z"
      templateName: echo-template
      templateScope: local/test-exit-handler-workflow
      type: Pod
    test-exit-handler-workflow-3293460608:
      children:
      - test-exit-handler-workflow-1323598490
      displayName: test-exit-handler-workflow.onExit
      finishedAt: "2024-07-29T15:35:39Z"
      id: test-exit-handler-workflow-3293460608
      message: child 'test-exit-handler-workflow-316667194' failed
      name: test-exit-handler-workflow.onExit
      nodeFlag:
        hooked: true
      outboundNodes:
      - test-exit-handler-workflow-4122663166
      phase: Failed
      progress: 0/1
      startedAt: "2024-07-29T15:35:39Z"
      templateName: exit-handler
      templateScope: local/test-exit-handler-workflow
      type: Steps
    test-exit-handler-workflow-4122663166:
      boundaryID: test-exit-handler-workflow-316667194
      displayName: sleep-and-exit
      finishedAt: null
      hostNodeName: node1
      id: test-exit-handler-workflow-4122663166
      inputs:
        parameters:
        - name: time-value
          value: "1"
      name: test-exit-handler-workflow.onExit[0].exit-handler-step.sleep-and-exit
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: test-exit-handler-workflow/test-exit-handler-workflow-handle-exit-template-4122663166/main.log
        exitCode: "0"
      phase: Running
      progress: 0/1
      startedAt: "2024-07-29T15:35:39Z"
      templateName: handle-exit-template
      templateScope: local/test-exit-handler-workflow
      type: Pod
  phase: Failed
  progress: 0/2
  startedAt: "2024-07-29T15:35:30Z"
  taskResultsCompletionStatus:
    test-exit-handler-workflow-2006132045: true
    test-exit-handler-workflow-4122663166: true

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: test-exit-handler-workflow
  annotations:
    description: 'Test workflow'
spec:
  podGC:
    strategy: OnPodCompletion
  entrypoint: workflow-startup
  onExit: exit-handler

  templates:
  - name: workflow-startup
    dag:
      tasks:
      - name: say-hello
        template: echo-template
        arguments:
          parameters:
            - name: message
              value: Hi!
            - name: image
              value: "alpine:notexist"

  - name: handle-exit-dag
    inputs:
      parameters:
        - name: time-value
    dag:
      tasks:
      - name: sleep-and-exit
        template: handle-exit-template
        arguments:
          parameters:
            - name: time-value
              value: "{{inputs.parameters.time-value}}"

  - name: handle-exit-steps
    inputs:
      parameters:
        - name: time-value
    steps:
      - - name: sleep-and-exit
          template: handle-exit-template
          arguments:
            parameters:
              - name: time-value
                value: "{{inputs.parameters.time-value}}"

  - name: handle-exit-template
    inputs:
      parameters:
        - name: time-value
    container:
      image: alpine:latest
      imagePullPolicy: Never
      command: [sh, -c]
      args: ["sleep {{inputs.parameters.time-value}}; echo Done!"]

  - name: echo-template
    inputs:
      parameters:
      - name: message
      - name: image
        default: "alpine:latest"
    script:
      image: "{{inputs.parameters.image}}"
      imagePullPolicy: Never
      command: [sh, -c]
      args: ["echo {{inputs.parameters.message}}"]

  - name: exit-handler
    steps:
      - - arguments:
            parameters:
              - name: time-value
                value: "1"
          name: exit-handler-step
          template: handle-exit-dag
          # template: handle-exit-template
          # template: handle-exit-steps

Logs from the workflow controller

time="2024-07-29T15:35:30.384Z" level=info msg="Processing workflow" Phase= ResourceVersion=124371 namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.389Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2024-07-29T15:35:30.389Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2024-07-29T15:35:30.389Z" level=info msg="Task-result reconciliation" namespace=neat-workflows numObjs=0 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.389Z" level=info msg="Updated phase  -> Running" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.389Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.389Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.389Z" level=info msg="DAG node test-exit-handler-workflow initialized Running" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.389Z" level=warning msg="was unable to obtain the node for test-exit-handler-workflow-2006132045, taskName say-hello"
time="2024-07-29T15:35:30.389Z" level=warning msg="was unable to obtain the node for test-exit-handler-workflow-2006132045, taskName say-hello"
time="2024-07-29T15:35:30.389Z" level=info msg="All of node test-exit-handler-workflow.say-hello dependencies [] completed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.389Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.390Z" level=info msg="Pod node test-exit-handler-workflow-2006132045 initialized Pending" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.390Z" level=warning msg="'script.source' is empty, suggest change template into 'container'" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.400Z" level=info msg="Created pod: test-exit-handler-workflow.say-hello (test-exit-handler-workflow-echo-template-2006132045)" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.400Z" level=info msg="TaskSet Reconciliation" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.400Z" level=info msg=reconcileAgentPod namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:30.400Z" level=info msg="Workflow to be dehydrated" Workflow Size=3915
time="2024-07-29T15:35:30.411Z" level=info msg="Workflow update successful" namespace=neat-workflows phase=Running resourceVersion=124376 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:31.479Z" level=info msg="cleaning up pod" action=killContainers key=neat-workflows/test-exit-handler-workflow-handle-exit-template-3165937232/killContainers
time="2024-07-29T15:35:32.400Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=124376 namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:32.401Z" level=info msg="Task-result reconciliation" namespace=neat-workflows numObjs=1 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:32.401Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-2006132045 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:32.401Z" level=info msg="node changed" namespace=neat-workflows new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=test-exit-handler-workflow-2006132045 old.message= old.phase=Pending old.progress=0/1 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:32.402Z" level=warning msg="'script.source' is empty, suggest change template into 'container'" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:32.402Z" level=info msg="TaskSet Reconciliation" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:32.402Z" level=info msg=reconcileAgentPod namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:32.402Z" level=info msg="Workflow to be dehydrated" Workflow Size=4270
time="2024-07-29T15:35:32.409Z" level=info msg="Workflow update successful" namespace=neat-workflows phase=Running resourceVersion=124394 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:34.411Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=124394 namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:34.411Z" level=info msg="Task-result reconciliation" namespace=neat-workflows numObjs=1 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:34.412Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-2006132045 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:34.412Z" level=info msg="node changed" namespace=neat-workflows new.message="ErrImageNeverPull: Container image \"alpine:notexist\" is not present with pull policy of Never" new.phase=Pending new.progress=0/1 nodeID=test-exit-handler-workflow-2006132045 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:34.412Z" level=warning msg="'script.source' is empty, suggest change template into 'container'" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:34.412Z" level=info msg="TaskSet Reconciliation" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:34.412Z" level=info msg=reconcileAgentPod namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:34.412Z" level=info msg="Workflow to be dehydrated" Workflow Size=4348
time="2024-07-29T15:35:34.419Z" level=info msg="Workflow update successful" namespace=neat-workflows phase=Running resourceVersion=124405 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:36.421Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=124405 namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:36.422Z" level=info msg="Task-result reconciliation" namespace=neat-workflows numObjs=1 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:36.422Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-2006132045 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:36.422Z" level=info msg="node unchanged" namespace=neat-workflows nodeID=test-exit-handler-workflow-2006132045 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:36.423Z" level=warning msg="'script.source' is empty, suggest change template into 'container'" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:36.423Z" level=info msg="TaskSet Reconciliation" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:36.423Z" level=info msg=reconcileAgentPod namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:36.423Z" level=info msg="Workflow to be dehydrated" Workflow Size=4348
time="2024-07-29T15:35:36.429Z" level=info msg="Workflow update successful" namespace=neat-workflows phase=Running resourceVersion=124405 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.843Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=124411 namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.844Z" level=info msg="Task-result reconciliation" namespace=neat-workflows numObjs=1 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.844Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-2006132045 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.844Z" level=info msg="node unchanged" namespace=neat-workflows nodeID=test-exit-handler-workflow-2006132045 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.844Z" level=info msg="Terminating pod as part of workflow shutdown" namespace=neat-workflows podName=test-exit-handler-workflow-echo-template-2006132045 shutdownStrategy=Stop workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.844Z" level=info msg="node test-exit-handler-workflow-2006132045 phase Pending -> Failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.844Z" level=info msg="node test-exit-handler-workflow-2006132045 message: workflow shutdown with strategy:  Stop" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.844Z" level=info msg="node test-exit-handler-workflow-2006132045 finished: 2024-07-29 15:35:39.844819925 +0000 UTC" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg="Outbound nodes of test-exit-handler-workflow set to [test-exit-handler-workflow-2006132045]" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg="node test-exit-handler-workflow phase Running -> Failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg="node test-exit-handler-workflow finished: 2024-07-29 15:35:39.84532146 +0000 UTC" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg="TaskSet Reconciliation" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg=reconcileAgentPod namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg="Running OnExit handler: exit-handler" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg="Steps node test-exit-handler-workflow-3293460608 initialized Running" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg="StepGroup node test-exit-handler-workflow-1323598490 initialized Running" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=info msg="DAG node test-exit-handler-workflow-316667194 initialized Running" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.845Z" level=warning msg="was unable to obtain the node for test-exit-handler-workflow-4122663166, taskName sleep-and-exit"
time="2024-07-29T15:35:39.845Z" level=warning msg="was unable to obtain the node for test-exit-handler-workflow-4122663166, taskName sleep-and-exit"
time="2024-07-29T15:35:39.845Z" level=info msg="All of node test-exit-handler-workflow.onExit[0].exit-handler-step.sleep-and-exit dependencies [] completed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.846Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.846Z" level=info msg="Pod node test-exit-handler-workflow-4122663166 initialized Pending" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.850Z" level=info msg="cleaning up pod" action=terminateContainers key=neat-workflows/test-exit-handler-workflow-echo-template-2006132045/terminateContainers
time="2024-07-29T15:35:39.855Z" level=info msg="Created pod: test-exit-handler-workflow.onExit[0].exit-handler-step.sleep-and-exit (test-exit-handler-workflow-handle-exit-template-4122663166)" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="Outbound nodes of test-exit-handler-workflow-316667194 set to [test-exit-handler-workflow-4122663166]" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="node test-exit-handler-workflow-316667194 phase Running -> Failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="node test-exit-handler-workflow-316667194 finished: 2024-07-29 15:35:39.85553501 +0000 UTC" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="Step group node test-exit-handler-workflow-1323598490 deemed failed: child 'test-exit-handler-workflow-316667194' failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="node test-exit-handler-workflow-1323598490 phase Running -> Failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="node test-exit-handler-workflow-1323598490 message: child 'test-exit-handler-workflow-316667194' failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="node test-exit-handler-workflow-1323598490 finished: 2024-07-29 15:35:39.855610079 +0000 UTC" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="step group test-exit-handler-workflow-1323598490 was unsuccessful: child 'test-exit-handler-workflow-316667194' failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="Outbound nodes of test-exit-handler-workflow-316667194 is [test-exit-handler-workflow-4122663166]" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="Outbound nodes of test-exit-handler-workflow-3293460608 is [test-exit-handler-workflow-4122663166]" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="node test-exit-handler-workflow-3293460608 phase Running -> Failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="node test-exit-handler-workflow-3293460608 message: child 'test-exit-handler-workflow-316667194' failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="node test-exit-handler-workflow-3293460608 finished: 2024-07-29 15:35:39.855692882 +0000 UTC" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:39.855Z" level=info msg="cleaning up pod" action=deletePod key=neat-workflows/test-exit-handler-workflow-echo-template-2006132045/deletePod
time="2024-07-29T15:35:39.856Z" level=info msg="Workflow to be dehydrated" Workflow Size=5893
time="2024-07-29T15:35:39.866Z" level=info msg="Workflow update successful" namespace=neat-workflows phase=Running resourceVersion=124418 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:41.860Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=124418 namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:41.861Z" level=info msg="Task-result reconciliation" namespace=neat-workflows numObjs=2 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:41.861Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-2006132045 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:41.861Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-4122663166 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:41.862Z" level=info msg="node changed" namespace=neat-workflows new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=test-exit-handler-workflow-4122663166 old.message= old.phase=Pending old.progress=0/1 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:41.862Z" level=info msg="TaskSet Reconciliation" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:41.862Z" level=info msg=reconcileAgentPod namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:41.862Z" level=info msg="Running OnExit handler: exit-handler" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:41.862Z" level=info msg="Workflow to be dehydrated" Workflow Size=5957
time="2024-07-29T15:35:41.870Z" level=info msg="Workflow update successful" namespace=neat-workflows phase=Running resourceVersion=124460 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.648Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=124460 namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.649Z" level=info msg="Task-result reconciliation" namespace=neat-workflows numObjs=2 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.650Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-2006132045 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.650Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-4122663166 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.650Z" level=info msg="node changed" namespace=neat-workflows new.message= new.phase=Running new.progress=0/1 nodeID=test-exit-handler-workflow-4122663166 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.650Z" level=info msg="TaskSet Reconciliation" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.650Z" level=info msg=reconcileAgentPod namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.650Z" level=info msg="Running OnExit handler: exit-handler" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.650Z" level=info msg="Workflow to be dehydrated" Workflow Size=5941
time="2024-07-29T15:35:44.658Z" level=info msg="Workflow update successful" namespace=neat-workflows phase=Running resourceVersion=124470 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:44.868Z" level=info msg="cleaning up pod" action=deletePod key=neat-workflows/test-exit-handler-workflow-echo-template-2006132045/deletePod
time="2024-07-29T15:35:46.665Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=124470 namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.666Z" level=info msg="Task-result reconciliation" namespace=neat-workflows numObjs=2 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.666Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-2006132045 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.666Z" level=info msg="task-result changed" namespace=neat-workflows nodeID=test-exit-handler-workflow-4122663166 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.666Z" level=info msg="node changed" namespace=neat-workflows new.message= new.phase=Running new.progress=0/1 nodeID=test-exit-handler-workflow-4122663166 old.message= old.phase=Running old.progress=0/1 workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.666Z" level=info msg="TaskSet Reconciliation" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.666Z" level=info msg=reconcileAgentPod namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.666Z" level=info msg="Running OnExit handler: exit-handler" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.667Z" level=info msg="Updated phase Running -> Failed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.667Z" level=info msg="Updated message  -> Stopped with strategy 'Stop'" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.667Z" level=info msg="Marking workflow completed" namespace=neat-workflows workflow=test-exit-handler-workflow
time="2024-07-29T15:35:46.667Z" level=info msg="Workflow to be dehydrated" Workflow Size=6143
time="2024-07-29T15:35:46.671Z" level=info msg="cleaning up pod" action=terminateContainers key=neat-workflows/test-exit-handler-workflow-handle-exit-template-4122663166/terminateContainers
time="2024-07-29T15:35:46.672Z" level=info msg="cleaning up pod" action=deletePod key=neat-workflows/test-exit-handler-workflow-1340600742-agent/deletePod
time="2024-07-29T15:35:46.675Z" level=info msg="Workflow update successful" namespace=neat-workflows phase=Failed resourceVersion=124480 workflow=test-exit-handler-workflow
time="2024-07-29T15:36:16.672Z" level=info msg="cleaning up pod" action=killContainers key=neat-workflows/test-exit-handler-workflow-handle-exit-template-4122663166/killContainers

Logs from in your workflow's wait container

time="2024-07-29T15:35:44.790Z" level=info msg="No output parameters"
time="2024-07-29T15:35:44.790Z" level=info msg="No output artifacts"
time="2024-07-29T15:35:44.791Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: test-exit-handler-workflow/test-exit-handler-workflow-handle-exit-template-4122663166/main.log"
time="2024-07-29T15:35:44.791Z" level=info msg="Creating minio client using static credentials" endpoint="neat-minio:9000"
time="2024-07-29T15:35:44.791Z" level=info msg="Saving file to s3" bucket=argo-artifacts endpoint="neat-minio:9000" key=test-exit-handler-workflow/test-exit-handler-workflow-handle-exit-template-4122663166/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-07-29T15:35:44.797Z" level=info msg="Save artifact" artifactName=main-logs duration=6.595055ms error="<nil>" key=test-exit-handler-workflow/test-exit-handler-workflow-handle-exit-template-4122663166/main.log
time="2024-07-29T15:35:44.797Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-07-29T15:35:44.797Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-07-29T15:35:44.807Z" level=info msg="Alloc=8107 TotalAlloc=14020 Sys=25189 NumGC=4 Goroutines=10"
time="2024-07-29T15:35:44.812Z" level=info msg="Deadline monitor stopped"
shuangkun commented 10 hours ago

I have reproduced this issue and will try to find the root cause later.