argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.1k stars 3.2k forks source link

Daemon pods keep running after DAG fails when `failFast: true` #10313

Open igorcalabria opened 1 year ago

igorcalabria commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

I expect the daemon pod to be terminated when the workflow fails, but that's not the case. The workflow is correctly marked as failed but the daemon pod keeps running until the workflow is deleted. I think it tries to delete the daemon, but it's getting a 404 response (from controller):

time="2023-01-05T18:17:43.909Z" level=info msg="Checking daemoned children of " namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.914Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-nginx-7m8fc-1340600742-agent/deletePod
time="2023-01-05T18:17:43.915Z" level=info msg="Delete pods 404"

Some other notes:

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: daemon-nginx-
  namespace: argo
spec:
  entrypoint: daemon-nginx-example
  templates:
  - name: daemon-nginx-example
    failFast: true
    parallelism: 2
    dag:
      tasks:
      - name: nginx-server
        template: nginx-server
      - name: nginx-client
        template: nginx-client
        depends: "nginx-server"
        withParam: |
          ["one", "two"]
        arguments:
          parameters:
          - name: server-ip
            value: "{{tasks.nginx-server.ip}}"
  - name: nginx-server
    daemon: true
    container:
      image: nginx:1.13
      readinessProbe:
        httpGet:
          path: /
          port: 80
        initialDelaySeconds: 2
        timeoutSeconds: 1
  - name: nginx-client
    inputs:
      parameters:
      - name: server-ip
    container:
      image: appropriate/curl:latest
      command: ["/bin/sh", "-c"]
      # Fail 
      args: ["aaaaaaaaaa"]

Logs from the workflow controller

time="2023-01-05T18:17:03.870Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="DAG node daemon-nginx-7m8fc initialized Running" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="All of node daemon-nginx-7m8fc.nginx-server dependencies [] completed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.880Z" level=info msg="Pod node daemon-nginx-7m8fc-1217350964 initialized Pending" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.886Z" level=info msg="Created pod: daemon-nginx-7m8fc.nginx-server (daemon-nginx-7m8fc-nginx-server-1217350964)" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.886Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.886Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:03.890Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=807268 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.886Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="Node became daemoned" namespace=argo nodeId=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=daemon-nginx-7m8fc-1217350964 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="TaskGroup node daemon-nginx-7m8fc-3902071824 initialized Running (message: )" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="All of node daemon-nginx-7m8fc.nginx-client(0:one) dependencies [nginx-server] completed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.887Z" level=info msg="Pod node daemon-nginx-7m8fc-3898481205 initialized Pending" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="Created pod: daemon-nginx-7m8fc.nginx-client(0:one) (daemon-nginx-7m8fc-nginx-client-3898481205)" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="All of node daemon-nginx-7m8fc.nginx-client(1:two) dependencies [nginx-server] completed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="template (node daemon-nginx-7m8fc) active children parallelism exceeded 2" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.890Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:13.899Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=807303 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 127)" new.phase=Failed new.progress=0/1 nodeID=daemon-nginx-7m8fc-3898481205 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node daemon-nginx-7m8fc phase Running -> Failed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node daemon-nginx-7m8fc message: template has failed or errored children and failFast enabled" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=info msg="node daemon-nginx-7m8fc finished: 2023-01-05 18:17:23.891758459 +0000 UTC" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.891Z" level=error msg="error in entry template execution" error="Max parallelism reached" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.895Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=807339 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:23.900Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/daemon-nginx-7m8fc-nginx-client-3898481205/labelPodCompleted
time="2023-01-05T18:17:33.895Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Updated message  -> template has failed or errored children and failFast enabled" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.895Z" level=info msg="Checking daemoned children of " namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:33.901Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-nginx-7m8fc-1340600742-agent/deletePod
time="2023-01-05T18:17:33.908Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=807359 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-nginx-7m8fc-1217350964 workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.909Z" level=info msg="Checking daemoned children of " namespace=argo workflow=daemon-nginx-7m8fc
time="2023-01-05T18:17:43.914Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-nginx-7m8fc-1340600742-agent/deletePod

Logs from in your workflow's wait container

time="2023-01-05T18:17:17.185Z" level=info msg="Starting Workflow Executor" version=untagged
time="2023-01-05T18:17:17.188Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-01-05T18:17:17.188Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=daemon-nginx-7m8fc-nginx-client-3898481205 template="{\"name\":\"nginx-client\",\"inputs\":{\"parameters\":[{\"name\":\"server-ip\",\"value\":\"10.244.0.13\"}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"appropriate/curl:latest\",\"command\":[\"/bin/sh\",\"-c\"],\"args\":[\"aaaaaaaaaa\"],\"resources\":{}}}" version="&Version{Version:untagged,BuildDate:2023-01-05T16:21:00Z,GitCommit:0f58387c79728b84037aa96221d1c97a974402a4,GitTag:untagged,GitTreeState:clean,GoVersion:go1.18.9,Compiler:gc,Platform:linux/amd64,}"
time="2023-01-05T18:17:17.188Z" level=info msg="Starting deadline monitor"
time="2023-01-05T18:17:20.190Z" level=info msg="Main container completed" error="<nil>"
time="2023-01-05T18:17:20.190Z" level=info msg="Deadline monitor stopped"
time="2023-01-05T18:17:20.190Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-01-05T18:17:20.190Z" level=info msg="No output parameters"
time="2023-01-05T18:17:20.190Z" level=info msg="No output artifacts"
time="2023-01-05T18:17:20.190Z" level=info msg="Alloc=6340 TotalAlloc=12280 Sys=19666 NumGC=4 Goroutines=5"
time="2023-01-05T18:17:06.942Z" level=info msg="Starting Workflow Executor" version=untagged
time="2023-01-05T18:17:06.944Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-01-05T18:17:06.944Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=daemon-nginx-7m8fc-nginx-server-1217350964 template="{\"name\":\"nginx-server\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"daemon\":true,\"container\":{\"name\":\"\",\"image\":\"nginx:1.13\",\"resources\":{},\"readinessProbe\":{\"httpGet\":{\"path\":\"/\",\"port\":80},\"initialDelaySeconds\":2,\"timeoutSeconds\":1}}}" version="&Version{Version:untagged,BuildDate:2023-01-05T16:21:00Z,GitCommit:0f58387c79728b84037aa96221d1c97a974402a4,GitTag:untagged,GitTreeState:clean,GoVersion:go1.18.9,Compiler:gc,Platform:linux/amd64,}"
time="2023-01-05T18:17:06.944Z" level=info msg="Starting deadline monitor"
stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

tooptoop4 commented 4 weeks ago

https://github.com/argoproj/argo-workflows/pull/10430