argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.07k stars 3.2k forks source link

ArtifactGC fails if output artifact with non-existent file is configured with `optional: true` #13583

Closed encigem closed 1 month ago

encigem commented 2 months ago

Pre-requisites

What happened? What did you expect to happen?

What happened?:

What did I expect to happen:

Version(s)

v3.4.10, v3.5.10, latest(sha256:4f03ff7ecaef4061dddd2c08f80de4d766b253aa3a57a87e69dd3a797bb42b1e)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-artgc-
  namespace: argo
spec:
  entrypoint: main
  activeDeadlineSeconds: 20

  volumes:
    - name: emptydir
      emptyDir: {}

  templates:
    - name: main
      dag:
        tasks:
          - name: prepare-data
            template: prepare-data
            arguments:
              parameters: 
              - name: index
                value: "{{item}}"
            withParam: '[1, 2, 3]'

    - name: prepare-data
      inputs:
        parameters:
          - name: index
          - name: image
            value: "alpine:latest"
      outputs:
        artifacts:
          # Comment out the 'late-artifact' item and re-run to get a
          # successful artifact garbage collection pod execution
          - name: late-artifact-{{inputs.parameters.index}}
            path: "/empty-dir/late-artifact.txt"
            artifactGC:
              strategy: OnWorkflowCompletion
            optional: true  # removing this will result in no ArtGC pod being run
          - name: abcartifact-{{inputs.parameters.index}}
            path: "/empty-dir/abcartifact.txt"
            artifactGC:
              strategy: OnWorkflowCompletion
            optional: true
          - name: 123artifact-{{inputs.parameters.index}}
            path: "/empty-dir/123artifact.txt"
            artifactGC:
              strategy: OnWorkflowCompletion
            optional: true
      container:
        image: "{{inputs.parameters.image}}"
        volumeMounts:
          - name: emptydir
            mountPath: /empty-dir
        workingDir: /empty-dir
        command: ["sh", "-c"]
        args:
          - |

            set -x

            echo Hi > abcartifact.txt
            echo Hi > 123artifact.txt

            # cause timeout
            sleep 20

            echo Hi > late-artifact.txt

Logs from the workflow controller

$  kubectl logs -n argo deploy/workflow-controller | grep test-artgc-pv7h2

time="2024-09-09T15:50:26.593Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=209475215 namespace=argo workflow=test-artgc-pv7h2
time="2024-09-09T15:50:26.594Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=test-artgc-pv7h2
time="2024-09-09T15:50:36.472Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=209475215 namespace=argo workflow=test-artgc-pv7h2
time="2024-09-09T15:50:36.472Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=test-artgc-pv7h2

Logs from in your workflow's wait container

$  kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=test-artgc-pv7h2,workflow.argoproj.io/phase!=Succeeded

error: container wait is not valid for pod test-artgc-pv7h2-artgc-wfcomp-2166136261
encigem commented 2 months ago

More info:

Logs of succeeded ArtGC pod, when non-existent artifact is commented out:

$ kubectl logs -n argo test-artgc-mqhjs-artgc-wfcomp-2166136261
time="2024-09-10T07:02:51.393Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/123artifact-2.tgz"
time="2024-09-10T07:02:51.393Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.393Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/123artifact-2.tgz
time="2024-09-10T07:02:51.479Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/abcartifact-2.tgz"
time="2024-09-10T07:02:51.479Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.479Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/abcartifact-2.tgz
time="2024-09-10T07:02:51.483Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/123artifact-3.tgz"
time="2024-09-10T07:02:51.483Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.484Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/123artifact-3.tgz
time="2024-09-10T07:02:51.487Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/abcartifact-3.tgz"
time="2024-09-10T07:02:51.487Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.487Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/abcartifact-3.tgz
time="2024-09-10T07:02:51.492Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/123artifact-1.tgz"
time="2024-09-10T07:02:51.492Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.492Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/123artifact-1.tgz
time="2024-09-10T07:02:51.495Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/abcartifact-1.tgz"
time="2024-09-10T07:02:51.495Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.495Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/abcartifact-1.tgz

Logs of failed ArtGC pod when WorkFlow fails due to timeout and non-existent artifact is included in the WF YAML:

$ kubectl logs -n argo test-artgc-pv7h2-artgc-wfcomp-2166136261
Error: You need to configure artifact storage. More information on how to do this can be found in the docs: https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/
You need to configure artifact storage. More information on how to do this can be found in the docs: https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/
agilgur5 commented 2 months ago

Hmm, I thought this would have been fixed by https://github.com/argoproj/argo-workflows/pull/13066, sounds like the optional: true case wasn't covered? I don't think that PR even considers it since if it's not there, there's nothing to delete (as you said as well) 🤔 cc @juliev0

juliev0 commented 2 months ago

I have to admit that I wasn't aware of the optional parameter until now. :)

Now I see what you're saying, and see that my PR has some issues both with:

  1. prematurely returning when there's any single artifact which errors out instead of accounting for the other output artifacts, and
  2. not accounting for the Optional artifact case

I assigned it to myself to address sometime soon. (Or otherwise let me know if you'd like to work on it @encigem )

agilgur5 commented 2 months ago
  1. Yes this is closely related to https://github.com/argoproj/argo-workflows/pull/13066#discussion_r1633634926 -- this general issue would be handled by parallelization (https://github.com/argoproj/argo-workflows/pull/11768 / https://github.com/argoproj/argo-workflows/issues/12442) as they'd no longer be handled serially, so I'd probably just recommend rewriting it in parallel
encigem commented 2 months ago

I'm afraid w.r.t ArgoWF, my skills are for finding the bugs, but not fixing them :) Please go ahead as assignee @juliev0. Thank you! 👍

juliev0 commented 2 months ago

You know, I'm thinking about this comment you wrote after I wrote up that PR, @agilgur5.

If I did this (i.e. just ignoring the error on the Artifact GC side) and basically reverted my previous PR, then the WorkflowTaskResult would go back to just including all artifacts, whether or not they're there, whether or not they're optional. So, we may some of the time create an ArtifactGC Pod which essentially does nothing, and that seems okay I guess.

What do you think?

agilgur5 commented 2 months ago

Ah I thought I had a more specific comment on this exact scenario, thanks for finding it!

I still think parallelized deletion and saving would be more optimal and would force us to properly handle these scenarios instead of a premature return.

Although if you're looking for a quick fix, yes, that sounds like it could handle this scenario

juliev0 commented 1 month ago

Just started to work on this and realized that if I do revert the earlier change I made, it means that all artifacts would be included in the WorkflowTaskResult, which would mean that during Artifact GC, even if the pod does attempt deletion of all artifacts, it will still have an error trying to delete some of them. And without the ForceFinalizerRemoval flag set, then the Workflow's Finalizer will still be on, and deletion will be deemed to have failed.

For both Optional and non-Optional artifacts, it seems that we only want to attempt deletion of whatever exists and we don't want to fail Artifact GC just because we're trying to delete some artifact that doesn't exist. If it was a non-Optional artifact, we will have Failed the Workflow itself, but that doesn't mean we should also fail Artifact GC.

Therefore, I'm thinking of instead maintaining the notion that a WorkflowTaskResult only includes the artifacts that were successfully written, but fixing the logic for it based on the problems identified by @encigem.

@agilgur5 @encigem feel free to differ with anything I'm saying here if I'm mistaken, thanks.

agilgur5 commented 1 month ago

Thanks for going through the logic! I agree, that sounds like the most correct way to handle it.