argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.94k stars 3.18k forks source link

`no artifact logs are available` when workflow is archived but still live #12948

Open liudongqing opened 5 months ago

liudongqing commented 5 months ago

Pre-requisites

What happened/what did you expect to happen?

We just upgrade the argo workflow from 3.4.4 to 3.5.5. We enabled archive

persistence:
    connectionPool:
      maxIdleConns: 100
      maxOpenConns: 0
    # save the entire workflow into etcd and DB
    nodeStatusOffLoad: true
    # enable archiving of old workflows
    archive: true
    postgresql:

but didn't enable archive logs.

artifactRepository:
  # -- Archive the main container logs as an artifact
  archiveLogs: false

Before upgrade, we can see logs of the finished workflow (either success or fail) from UI(the server gets the log from pod I guess), but after upgrade, the UI will complain " no artifact logs are available " and no logs returned.

Is it an expected result ? or is any configuration item controlling this behavior ?

Version

v3.5.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

any workflows

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
agilgur5 commented 5 months ago

from UI(the server gets the log from pod I guess)

Correct, it retrieves Pod logs.

but after upgrade, the UI will complain " no artifact logs are available " and no logs returned.

I'm not sure that this is related to the upgrade? You changed your configuration after the upgrade? Or before it?

An Archived Workflow is typically a deleted Workflow, therefore there are no Pods for it to retrieve logs from. So if you want logs for deleted Pods, you can either link to a log provider or use artifact logs. You don't have artifact logs, so the error message certainly sounds correct.

liudongqing commented 5 months ago

An Archived Workflow is typically a deleted Workflow, therefore there are no Pods for it to retrieve logs from. So if you want logs for deleted Pods, you can either link to a log provider or use artifact logs. You don't have artifact logs, so the error message certainly sounds correct.

We didn't change any configuration during the upgrade, the only change is the image tag from "v3.4.4" to "v3.5.5". The problem is, the workflow will be archived once the workflow finished, we have no chance to check the log event it is failed just 1 min before. By enabling the artifacts logs, we can see log now.

Is it correct for a finished workflow became archived immediately?

agilgur5 commented 5 months ago

Is it correct for a finished workflow became archived immediately?

A Workflow is labeled for archiving when it completes and when that label is detected, archiving is kicked off

That is generally independent of deletion, however, which is based on your TTL or retentionPolicy. It sounds like you have a longer TTL potentially, and so you have Workflows that are simultaneously in the archive and still in the cluster? In that case, the pod logs should still be retrievable.

I think I see the issue here, it's probably not falling back to Pod logs properly in 3.5.

3.5 unified the Archived + Live UI into one page (#11121) so there is no distinction now in the UI. In particular, this line would previously only be triggered if you were navigating archived workflows specifically, but now it can be triggered on a live workflow that is also archived. The comment above that line is not quite correct in your case

y-elip commented 3 months ago

@agilgur5 Hello, any idea when this degradation will be fixed? It is preventing us to update to newer version of Argo-Workflows, because having access to completed or failed workflows logs is important part of our daily routine

agilgur5 commented 3 months ago

Hello, any idea when this degradation will be fixed?

No, any updates would be in the thread. PRs welcome.

having access to completed or failed workflows logs is important part of our daily routine

To be clear this only affects users of Archived Workflows with long Workflow or podGC TTLs. If you're not using Archived Workflows or have short TTLs, this doesn't affect you.

miltalex commented 3 months ago

I will have a look to check if I can prepare an PR with a fix

miltalex commented 2 months ago

Could I ask for some example configuration or a way to reproduce the issue consistently? I tried using archived workflows with different TTL values and strategies without much success and I feel some of my settings might be different from the ones that produce the above bug.

y-elip commented 2 months ago

Sure. We are using helm chart 0.41.11 for argo-wf ver 3.5.8

persistence:
  archive: true
  postgresql:
    <postgresql related block>
controller:
    workflowDefaults:
      spec:
        ttlStrategy:
          secondsAfterSuccess: 432000
          secondsAfterFailure: 864000
          secondsAfterCompletion: 432000
y-elip commented 2 months ago

I also forgot to mention this important part of configuration

artifactRepository:
    archiveLogs: false