argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

UI doesn't show archived logs for containerSet steps #13785

Open static-moonlight opened 1 month ago

static-moonlight commented 1 month ago

Pre-requisites

What happened? What did you expect to happen?

I am currently looking into ways to optimize the duration of my workflows. And I am looking at containerSet as well. However, I noticed that accessing logs doesn't work well here. Except for a few seconds after the task was finished, the logs are always empty.

image

But the logs are there, sitting in the artifact repository, waiting to be looked at.

multi-step-wt7dp/
multi-step-wt7dp/multi-step-wt7dp-multi-step-2322966654/
multi-step-wt7dp/multi-step-wt7dp-multi-step-2322966654/step-1.log
multi-step-wt7dp/multi-step-wt7dp-multi-step-2322966654/step-2.log
multi-step-wt7dp/multi-step-wt7dp-multi-step-2322966654/step-3.log

If I click on the link, below the log area (in the UI), I will see them

image

And yes, I know that the UI specifically states "Your pod GC settings will delete pods and their logs immediately on completion. Logs may not appear for pods that are deleted.".

However, this doesn't seem to be an issue for any other workflow, and using the podGC strategy OnPodCompletion is pretty much standard for all my workflows. Simply put: despite what the UI says, it's usually working anyway, except for containerSet tasks.

Also consider this: Other people (like operators) are supposed to work with this system as well. If they have difficulties to read the logs, or if there is some extra step necessary, which also breaks the UI experience (the logs are not opened in a new tab or something, they will be opened in the same tab the Argo server UI is currently running in, basically meaning, they will leave the Argo UI to see the logs), this will decrease their general acceptance when I put them in front of a new system like Argo workflows.

Steps to reproduce:

  1. Deploy the workflow template (artifact repository required)
  2. Run the workflow
  3. Wait a minute
  4. Try to access the logs in the UI

TLDR: The logs are there, Argo can usually show them, even if the pods was already removed, so why does that not work with containerSet?

Version(s)

v.3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: multi-step
  namespace: test
spec:
  entrypoint: entrypoint
  serviceAccountName: test
  artifactGC:
    serviceAccountName: test
    strategy: OnWorkflowDeletion
    forceFinalizerRemoval: true
  podGC:
    strategy: OnPodCompletion
  archiveLogs: true
  activeDeadlineSeconds: 30
  ttlStrategy:
    secondsAfterSuccess: 300
    secondsAfterFailure: 300
  templates:
    - name: entrypoint
      dag:
        tasks:
          - name: multi-step
            template: multi-step
    - name: multi-step
      containerSet:
        containers:
          - name: step-1
            image: docker.io/busybox:1.36
            command: [echo,"step 1"]
          - name: step-2
            dependencies: [step-1]
            image: docker.io/busybox:1.36
            command: [echo,"step 2"]
          - name: step-3
            dependencies: [step-2]
            image: docker.io/busybox:1.36
            command: [echo,"step 3"]

Logs from the workflow controller

time="2024-10-18T11:03:08.780Z" level=info msg="Processing workflow" Phase= ResourceVersion=459976086 namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.793Z" level=info msg="Task-result reconciliation" namespace=test numObjs=0 workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.793Z" level=info msg="Updated phase  -> Running" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.793Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.793Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.793Z" level=info msg="DAG node multi-step-wt7dp initialized Running" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.793Z" level=warning msg="was unable to obtain the node for multi-step-wt7dp-2322966654, taskName multi-step"
time="2024-10-18T11:03:08.793Z" level=warning msg="was unable to obtain the node for multi-step-wt7dp-2322966654, taskName multi-step"
time="2024-10-18T11:03:08.793Z" level=info msg="All of node multi-step-wt7dp.multi-step dependencies [] completed" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.793Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.793Z" level=info msg="Pod node multi-step-wt7dp-2322966654 initialized Pending" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.801Z" level=info msg="Created pod: multi-step-wt7dp.multi-step (multi-step-wt7dp-multi-step-2322966654)" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.801Z" level=info msg="Container node multi-step-wt7dp-4212253844 initialized Pending" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.801Z" level=info msg="Container node multi-step-wt7dp-4262586701 initialized Pending" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.801Z" level=info msg="Container node multi-step-wt7dp-4245809082 initialized Pending" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.801Z" level=info msg="TaskSet Reconciliation" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.801Z" level=info msg=reconcileAgentPod namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:08.812Z" level=info msg="Workflow update successful" namespace=test phase=Running resourceVersion=459976091 workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.802Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=459976091 namespace=test workflow=multi-step-wt7d
time="2024-10-18T11:03:14.803Z" level=info msg="Task-result reconciliation" namespace=test numObjs=1 workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.803Z" level=info msg="task-result changed" namespace=test nodeID=multi-step-wt7dp-2322966654 workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.803Z" level=info msg="node multi-step-wt7dp-4212253844 phase Pending -> Succeeded" namespace=test workflow=multi-step-wt7d
time="2024-10-18T11:03:14.803Z" level=info msg="node multi-step-wt7dp-4212253844 finished: 2024-10-18 11:03:14.803299142 +0000 UTC" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.803Z" level=info msg="node multi-step-wt7dp-4262586701 phase Pending -> Succeeded" namespace=test workflow=multi-step-wt7d
time="2024-10-18T11:03:14.803Z" level=info msg="node multi-step-wt7dp-4262586701 finished: 2024-10-18 11:03:14.803319422 +0000 UTC" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.803Z" level=info msg="node multi-step-wt7dp-4245809082 phase Pending -> Succeeded" namespace=test workflow=multi-step-wt7d
time="2024-10-18T11:03:14.803Z" level=info msg="node multi-step-wt7dp-4245809082 finished: 2024-10-18 11:03:14.803335402 +0000 UTC" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.803Z" level=info msg="node changed" namespace=test new.message= new.phase=Running new.progress=0/1 nodeID=multi-step-wt7dp-2322966654 old.message= old.phase=Pending old.progress=0/1 workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.803Z" level=info msg="TaskSet Reconciliation" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.803Z" level=info msg=reconcileAgentPod namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.809Z" level=info msg="cleaning up pod" action=terminateContainers key=test/multi-step-wt7dp-multi-step-2322966654/terminateContainers
time="2024-10-18T11:03:14.809Z" level=info msg="https://[...]:443/api/v1/namespaces/test/pods/multi-step-wt7dp-multi-step-2322966654/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-10-18T11:03:14.814Z" level=info msg="Workflow update successful" namespace=test phase=Running resourceVersion=459976476 workflow=multi-step-wt7dp
time="2024-10-18T11:03:14.827Z" level=info msg="signaled container" container=wait error="unable to upgrade connection: container not found (\"wait\")" namespace=test pod=multi-step-wt7dp-multi-step-2322966654 stderr="<nil>" stdout="<nil>"
time="2024-10-18T11:03:20.816Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=459976476 namespace=test workflow=multi-step-wt7d
time="2024-10-18T11:03:20.816Z" level=info msg="Task-result reconciliation" namespace=test numObjs=1 workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.816Z" level=info msg="node changed" namespace=test new.message= new.phase=Succeeded new.progress=0/1 nodeID=multi-step-wt7dp-2322966654 old.message= old.phase=Running old.progress=0/1 workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.817Z" level=info msg="Outbound nodes of multi-step-wt7dp set to [multi-step-wt7dp-4245809082]" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.817Z" level=info msg="node multi-step-wt7dp phase Running -> Succeeded" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.817Z" level=info msg="node multi-step-wt7dp finished: 2024-10-18 11:03:20.817160127 +0000 UTC" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.817Z" level=info msg="TaskSet Reconciliation" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.817Z" level=info msg=reconcileAgentPod namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.817Z" level=info msg="Updated phase Running -> Succeeded" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.817Z" level=info msg="Marking workflow completed" namespace=test workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.825Z" level=info msg="Workflow update successful" namespace=test phase=Succeeded resourceVersion=459976763 workflow=multi-step-wt7dp
time="2024-10-18T11:03:20.826Z" level=info msg="Queueing Succeeded workflow test/multi-step-wt7dp for delete in 5m0s due to TTL"
time="2024-10-18T11:03:25.833Z" level=info msg="cleaning up pod" action=deletePod key=test/multi-step-wt7dp-multi-step-2322966654/deletePod
time="2024-10-18T11:03:44.827Z" level=info msg="cleaning up pod" action=killContainers key=test/multi-step-wt7dp-multi-step-2322966654/killContainers

Logs from in your workflow's wait container

No resources found in [...] namespace.
shuangkun commented 1 month ago

I tried it and found that this problem does exist. Log access is normal when the pod is not deleted. After cleaning up the pod, the log on minio should be accessed. I will locate the cause when I have time.

shuangkun commented 2 weeks ago

Will fixed by https://github.com/argoproj/argo-workflows/pull/13597

shuangkun commented 2 weeks ago

The root cause is that the archived log file name of ContainerSet is different from that of other containers. Containerset uses container name, while others use main.