argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.08k stars 3.2k forks source link

ArtifactGC not executing on `archiveLogs` when specified in Workflow level #13421

Open segues opened 3 months ago

segues commented 3 months ago

Pre-requisites

What happened? What did you expect to happen?

The artifactGC pod was not being deployed after workflow deletion. We have tested it with CronWorkflows also. We deploy argo-workflows with helm.

artifactRepositoryRef:
  artifact-repositories:
    annotations:
      workflows.argoproj.io/default-artifact-repository: default
    default:
      archiveLogs: true
      s3:
        bucket: bucket-name
        endpoint: s3.amazonaws.com
        keyFormat: workflows-logs/{{workflow.name}}/{{workflow.creationTimestamp}}
        useSDKCreds: true

In our use case we use archiveLogs: true and AWS IRSA to authenticate from the service accounts (workflow and argo server). At workflow default configuration we use the following:

workflowDefaults:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 8737
      artifactRepositoryRef:
        configMap: artifact-repositories
        key: default
      serviceAccountName: argo-workflows
      artifactGC:
        strategy: OnWorkflowDeletion
        serviceAccountName: argo-workflows
        forceFinalizerRemoval: true

With the workflow below, both the generated logs and the files saved as artifacts are correctly removed from the s3 bucket.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: artifact-gc-
spec:
  entrypoint: main
  artifactGC:
    strategy: OnWorkflowDeletion # the overall strategy, which can be overridden
    serviceAccountName: argo-workflows
  templates:
    - name: main
      container:
        image: argoproj/argosay:v2
        command:
          - sh
          - -c
        args:
          - |
            echo "hello world" > /tmp/on-completion.txt
            echo "hello world" > /tmp/on-deletion.txt
      outputs:
        artifacts:
          - name: on-completion
            path: /tmp/on-completion.txt
            s3:
              key: on-completion.txt
            artifactGC:
              strategy: OnWorkflowCompletion # overriding the default strategy for this artifact
          - name: on-deletion
            path: /tmp/on-deletion.txt
            s3:
              key: on-deletion.txt

But when we comment out the lines containing info about artifacts in the different workflow templates, the logs are not removed from the bucket.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: artifact-gc
  generateName: artifact-gc-
  namespace: workflows
spec:
  entrypoint: main
  artifactGC:
    strategy: OnWorkflowDeletion # the overall strategy, which can be overridden
    serviceAccountName: argo-workflows
  templates:
    - name: main
      container:
        image: argoproj/argosay:v2
        command:
          - sh
          - -c
        args:
          - |
            echo "hello world" > /tmp/on-completion.txt
            echo "hello world" > /tmp/on-deletion.txt
      # outputs:
        # artifacts:
          # - name: on-completion
          #   path: /tmp/on-completion.txt
          #   s3:
          #     key: workflows-logs/on-completion.txt
          #   artifactGC:
          #     strategy: OnWorkflowCompletion # overriding the default strategy for this artifact
          #     serviceAccountName: argo-workflows
          # - name: on-deletion
          #   path: /tmp/on-deletion.txt
          #   s3:
          #     key: workflows-logs/on-deletion.txt

After this configuration change, we receive this message in workflow status:

artifactGCStatus:
    notSpecified: true

We have found a workaround, adding this metadata for both Workflows and CronWorkflows:

metadata:
    finalizers:
      - workflows.argoproj.io/artifact-gc

After adding it, the logs are deleted from s3 bucket, but it's necessary to fix it.

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: artifact-gc
  generateName: artifact-gc-
  namespace: workflows
spec:
  entrypoint: main
  artifactGC:
    strategy: OnWorkflowDeletion # the overall strategy, which can be overridden
    serviceAccountName: argo-workflows
  templates:
    - name: main
      container:
        image: argoproj/argosay:v2
        command:
          - sh
          - -c
        args:
          - |
            echo "hello world" > /tmp/on-completion.txt
            echo "hello world" > /tmp/on-deletion.txt
      outputs:
        artifacts:
          - name: on-completion
            path: /tmp/on-completion.txt
            s3:
              key: workflows-logs/on-completion.txt
            artifactGC:
              strategy: OnWorkflowCompletion # overriding the default strategy for this artifact
              serviceAccountName: argo-workflows
          - name: on-deletion
            path: /tmp/on-deletion.txt
            s3:
              key: workflows-logs/on-deletion.txt

Logs from the workflow controller

kubectl logs -n argo deploy/argo-workflows-workflow-controller | grep artifact-gc
time="2024-07-31T12:48:46.359Z" level=info msg="Processing workflow" Phase= ResourceVersion=1459135652 namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:46.369Z" level=info msg="Task-result reconciliation" namespace=workflows numObjs=0 workflow=artifact-gc
time="2024-07-31T12:48:46.369Z" level=info msg="Updated phase  -> Running" namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:46.369Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:46.369Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:46.369Z" level=info msg="Pod node artifact-gc initialized Pending" namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:46.453Z" level=info msg="Created pod: artifact-gc (artifact-gc)" namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:46.453Z" level=info msg="TaskSet Reconciliation" namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:46.453Z" level=info msg=reconcileAgentPod namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:46.464Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=1459135656 workflow=artifact-gc
time="2024-07-31T12:48:56.436Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1459135656 namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:56.437Z" level=info msg="Task-result reconciliation" namespace=workflows numObjs=1 workflow=artifact-gc
time="2024-07-31T12:48:56.437Z" level=info msg="task-result changed" namespace=workflows nodeID=artifact-gc workflow=artifact-gc
time="2024-07-31T12:48:56.437Z" level=info msg="node changed" namespace=workflows new.message= new.phase=Succeeded new.progress=0/1 nodeID=artifact-gc old.message= old.phase=Pending old.progress=0/1 workflow=artifact-gc
time="2024-07-31T12:48:56.437Z" level=info msg="TaskSet Reconciliation" namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:56.437Z" level=info msg=reconcileAgentPod namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:56.437Z" level=info msg="Updated phase Running -> Succeeded" namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:56.438Z" level=info msg="Marking workflow completed" namespace=workflows workflow=artifact-gc
time="2024-07-31T12:48:56.444Z" level=info msg="cleaning up pod" action=deletePod key=workflows/artifact-gc-1340600742-agent/deletePod
time="2024-07-31T12:48:56.447Z" level=info msg="Workflow update successful" namespace=workflows phase=Succeeded resourceVersion=1459135849 workflow=artifact-gc
time="2024-07-31T12:48:56.464Z" level=info msg="cleaning up pod" action=labelPodCompleted key=workflows/artifact-gc/labelPodCompleted
time="2024-07-31T12:50:32.699Z" level=info msg="reconciling artifact-gc pod" message= namespace=workflows phase=Succeeded pod=hello-world-1722430080-artgc-wfdel-3857365535 workflow=hello-world-1722430080
time="2024-07-31T12:52:33.673Z" level=info msg="reconciling artifact-gc pod" message= namespace=workflows phase=Succeeded pod=hello-world-1722430200-artgc-wfdel-3857365535 workflow=hello-world-1722430200

Logs from in your workflow's wait container

kubectl logs -n workflows -c wait -l workflows.argoproj.io/workflow=artifact-gc,workflow.argoproj.io/phase!=Succeeded
time="2024-07-31T12:58:18.288Z" level=info msg="No output artifacts"
time="2024-07-31T12:58:18.288Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: workflows-logs/artifact-gc/2024-07-31T12:58:14Z/main.log"
time="2024-07-31T12:58:18.301Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2024-07-31T12:58:18.365Z" level=info msg="Saving file to s3" bucket=sgs-argo-prod-eu endpoint=s3.amazonaws.com key="workflows-logs/artifact-gc/2024-07-31T12:58:14Z/main.log" path=/tmp/argo/outputs/logs/main.log
time="2024-07-31T12:58:18.455Z" level=info msg="Save artifact" artifactName=main-logs duration=167.080514ms error="<nil>" key="workflows-logs/artifact-gc/2024-07-31T12:58:14Z/main.log"
time="2024-07-31T12:58:18.455Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-07-31T12:58:18.455Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-07-31T12:58:18.471Z" level=info msg="Alloc=10018 TotalAlloc=17754 Sys=24677 NumGC=5 Goroutines=12"
time="2024-07-31T12:58:18.477Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2024-07-31T12:58:18.477Z" level=info msg="Deadline monitor stopped"
agilgur5 commented 3 months ago

This sounds similar if not identical to #13338, cc @juliev0. Also as I wrote there:

Since we don't recommend archive logs in the docs, I'm not sure it makes sense to fix or change this; we may very well remove the archive logs feature entirely