argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.82k stars 3.17k forks source link

A daemon pod that is triggered through templateRef keeps running after the workflow ends #11734

Open shiraOvadia opened 1 year ago

shiraOvadia commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

The workflow basic structure: I run a workflow which is consists of 2 steps: one is a (redis) server pod which is configured as a daemon pod, and the second one is a consumer pod which gets the ip address of the daemon one and makes something with it. what I expect to happen: when the workflow is ended, the daemon pod stops as well. what actually happens: if I call the daemon pod via a "regular" template then the daemon pod ends as expected when the workflow ends. But, if I call the daemon pod via templateRef, then the daemon pod does not stop when the workflow ends and it keeps running...

Version

3.4.8, 3.5.0.rc1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: daemon-bug
spec:
  entrypoint: engine
  templates:
    - name: engine
      steps:
        - - name: set-daemon-srv
            templateRef:
              name: general-demo-templates
              template: daemon-template
        - - name: use-daemon-srv
            template: use-daemon-srv
            arguments:
              parameters:
                - name: srv-ip
                  value: "{{steps.set-daemon-srv.ip}}"
    - name: use-daemon-srv
      inputs:
        parameters:
          - name: srv-ip
      container:
        image: docker/whalesay
        command: [ sh, -c ]
        args: [ "echo server ip is: {{inputs.parameters.srv-ip}}" ]
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: general-demo-templates
spec:
  templates:
    - name: daemon-template
      daemon: true
      retryStrategy:
        limit: 3
      container:
        image: redis
        command: [ docker-entrypoint.sh ]
        args: [ "redis-server" ]
        readinessProbe:
          exec:
            command:
              - redis-cli
              - ping

Logs from the workflow controller

time="2023-09-03T13:32:02.297Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.301Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.301Z" level=info msg="Node became daemoned" namespace=argo nodeId=daemon-bug-kx56t-4049627799 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.301Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=daemon-bug-kx56t-4049627799 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.302Z" level=info msg="node daemon-bug-kx56t-2849824988 phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.302Z" level=info msg="node daemon-bug-kx56t-2849824988 finished: 2023-09-03 13:32:02.302944725 +0000 UTC" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.302Z" level=info msg="Step group node daemon-bug-kx56t-1751689535 successful" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.302Z" level=info msg="node daemon-bug-kx56t-1751689535 phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.303Z" level=info msg="node daemon-bug-kx56t-1751689535 finished: 2023-09-03 13:32:02.303003172 +0000 UTC" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.303Z" level=info msg="StepGroup node daemon-bug-kx56t-745282658 initialized Running" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.304Z" level=info msg="SG Outbound nodes of daemon-bug-kx56t-2849824988 are [daemon-bug-kx56t-4049627799]" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.304Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.304Z" level=info msg="Pod node daemon-bug-kx56t-285748714 initialized Pending" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.329Z" level=info msg="Created pod: daemon-bug-kx56t[1].use-daemon-srv (daemon-bug-kx56t-use-daemon-srv-285748714)" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.329Z" level=info msg="Workflow step group node daemon-bug-kx56t-745282658 not yet completed" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.329Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.329Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.341Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=10229907 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.329Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.338Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.338Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.339Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=daemon-bug-kx56t-285748714 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.339Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-bug-kx56t-4049627799 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.340Z" level=info msg="SG Outbound nodes of daemon-bug-kx56t-2849824988 are [daemon-bug-kx56t-4049627799]" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.341Z" level=info msg="Step group node daemon-bug-kx56t-745282658 successful" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.342Z" level=info msg="node daemon-bug-kx56t-745282658 phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.342Z" level=info msg="node daemon-bug-kx56t-745282658 finished: 2023-09-03 13:32:12.342955747 +0000 UTC" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.343Z" level=info msg="Outbound nodes of daemon-bug-kx56t-285748714 is [daemon-bug-kx56t-285748714]" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.343Z" level=info msg="Outbound nodes of daemon-bug-kx56t is [daemon-bug-kx56t-285748714]" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.343Z" level=info msg="node daemon-bug-kx56t phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.343Z" level=info msg="node daemon-bug-kx56t finished: 2023-09-03 13:32:12.343117518 +0000 UTC" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg="Updated phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg="Marking workflow completed" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.349Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-bug-kx56t-1340600742-agent/deletePod
time="2023-09-03T13:32:12.349Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/daemon-bug-kx56t--4049627799/terminateContainers
time="2023-09-03T13:32:12.358Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=10229990 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.364Z" level=info msg="archiving workflow" namespace=argo uid=4bf4eadb-7668-4d52-982a-bc015f678798 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.369Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/daemon-bug-kx56t-daemon-template-4049627799/labelPodCompleted

Logs from in your workflow's wait container

time="2023-09-03T13:31:54.298Z" level=info msg="Starting Workflow Executor" version=v3.5.0-rc1
time="2023-09-03T13:31:54.300Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-09-03T13:31:54.300Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=daemon-bug-kx56t-daemon-template-4049627799 template="{\"name\":\"daemon-template\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"daemon\":true,\"container\":{\"name\":\"\",\"image\":\"344864304324.dkr.ecr.us-east-1.amazonaws.com/nss_argo:redis_org\",\"command\":[\"docker-entrypoint.sh\"],\"args\":[\"redis-server\"],\"resources\":{},\"readinessProbe\":{\"exec\":{\"command\":[\"redis-cli\",\"ping\"]}}},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"s3.amazonaws.com\",\"bucket\":\"mobileye-nss-temp\",\"region\":\"us-east-1\",\"insecure\":true,\"key\":\"argo-logs/daemon-bug-kx56t/daemon-bug-kx56t-daemon-template-4049627799\"}},\"retryStrategy\":{\"limit\":3}}" version="&Version{Version:v3.5.0-rc1,BuildDate:2023-08-15T18:44:17Z,GitCommit:6cd3e810e9c953e0a78070f41ab8f7d799fe51d3,GitTag:v3.5.0-rc1,GitTreeState:clean,GoVersion:go1.20.7,Compiler:gc,Platform:linux/amd64,}"
time="2023-09-03T13:31:54.301Z" level=info msg="Starting deadline monitor"
time="2023-09-03T13:32:06.324Z" level=info msg="No output artifacts"
time="2023-09-03T13:32:06.324Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: argo-logs/daemon-bug-kx56t/daemon-bug-kx56t-use-daemon-srv-285748714/main.log"
time="2023-09-03T13:32:06.324Z" level=info msg="Creating minio client using IAM role"
time="2023-09-03T13:32:06.324Z" level=info msg="Saving file to s3" bucket=mobileye-nss-temp endpoint=s3.amazonaws.com key=argo-logs/daemon-bug-kx56t/daemon-bug-kx56t-use-daemon-srv-285748714/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-09-03T13:32:06.491Z" level=info msg="Save artifact" artifactName=main-logs duration=166.673472ms error="<nil>" key=argo-logs/daemon-bug-kx56t/daemon-bug-kx56t-use-daemon-srv-285748714/main.log
time="2023-09-03T13:32:06.491Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-09-03T13:32:06.491Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-09-03T13:32:06.505Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:sandbox-s3\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2023-09-03T13:32:06.535Z" level=info msg="Alloc=8376 TotalAlloc=17335 Sys=43389 NumGC=5 Goroutines=12"
time="2023-09-03T13:32:06.535Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
OllieRushton commented 8 months ago

Is there any work around for this? I've set a large value for ttlStrategy.secondsAfterFailure on the workflow to keep failed pods around for longer, but this causes the daemon pods to run for that length of time as well, which is becoming quite costly

OllieRushton commented 7 months ago

For anyone else struggling with this, I've realised that I hadn't set a podGC.strategy for my root workflow template which referenced the others. Now i've set this to OnWorkflowCompletion and the daemon pods are being removed as expected.

agilgur5 commented 7 months ago

12533 duplicated this, but had some code analysis that might be useful:

I look into the logs of the workflow-controller and find some clues: the template name is not correctly set for the terminateContainers action.

time="2024-01-17T09:32:09.931Z" level=info msg="cleaning up pod" action=terminateContainers key=default/daemon-nginx-s5ggl--452970571/terminateContainers

I then dig into the code and locate the function for this. It seems like the podName "default/daemon-nginx-s5ggl--452970571" generated by util.GeneratePodName missed the templateName part, which in this case is nil

I'm not sure if this bug is related to the initializeNode in which the TemplateName is set from the orgTmpl. So for a template referencing a workflowTemplate, the value is nil.

I wrote a small workflow that can reproduces the bug. All work fine if I use directly a template instead of referencing a workflowTemplate.

Any thoughts on how to fix this? Maybe resolve the template to get the actual podName at runtime like here?

Paritosh-Anand commented 3 months ago

is the workaround still working with version 3.5.6 as I can confirm that we are still affected even after trying the workaround.