Closed encigem closed 1 month ago
More info:
optional: true
config, none of the output artifacts are cleaned up.Logs of succeeded ArtGC pod, when non-existent artifact is commented out:
$ kubectl logs -n argo test-artgc-mqhjs-artgc-wfcomp-2166136261
time="2024-09-10T07:02:51.393Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/123artifact-2.tgz"
time="2024-09-10T07:02:51.393Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.393Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/123artifact-2.tgz
time="2024-09-10T07:02:51.479Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/abcartifact-2.tgz"
time="2024-09-10T07:02:51.479Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.479Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/abcartifact-2.tgz
time="2024-09-10T07:02:51.483Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/123artifact-3.tgz"
time="2024-09-10T07:02:51.483Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.484Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/123artifact-3.tgz
time="2024-09-10T07:02:51.487Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/abcartifact-3.tgz"
time="2024-09-10T07:02:51.487Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.487Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/abcartifact-3.tgz
time="2024-09-10T07:02:51.492Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/123artifact-1.tgz"
time="2024-09-10T07:02:51.492Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.492Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/123artifact-1.tgz
time="2024-09-10T07:02:51.495Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/abcartifact-1.tgz"
time="2024-09-10T07:02:51.495Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.495Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/abcartifact-1.tgz
Logs of failed ArtGC pod when WorkFlow fails due to timeout and non-existent artifact is included in the WF YAML:
$ kubectl logs -n argo test-artgc-pv7h2-artgc-wfcomp-2166136261
Error: You need to configure artifact storage. More information on how to do this can be found in the docs: https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/
You need to configure artifact storage. More information on how to do this can be found in the docs: https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/
Hmm, I thought this would have been fixed by https://github.com/argoproj/argo-workflows/pull/13066, sounds like the optional: true
case wasn't covered? I don't think that PR even considers it since if it's not there, there's nothing to delete (as you said as well) 🤔 cc @juliev0
I have to admit that I wasn't aware of the optional
parameter until now. :)
Now I see what you're saying, and see that my PR has some issues both with:
I assigned it to myself to address sometime soon. (Or otherwise let me know if you'd like to work on it @encigem )
I'm afraid w.r.t ArgoWF, my skills are for finding the bugs, but not fixing them :) Please go ahead as assignee @juliev0. Thank you! 👍
You know, I'm thinking about this comment you wrote after I wrote up that PR, @agilgur5.
If I did this (i.e. just ignoring the error on the Artifact GC side) and basically reverted my previous PR, then the WorkflowTaskResult
would go back to just including all artifacts, whether or not they're there, whether or not they're optional. So, we may some of the time create an ArtifactGC Pod which essentially does nothing, and that seems okay I guess.
What do you think?
Ah I thought I had a more specific comment on this exact scenario, thanks for finding it!
I still think parallelized deletion and saving would be more optimal and would force us to properly handle these scenarios instead of a premature return.
Although if you're looking for a quick fix, yes, that sounds like it could handle this scenario
Just started to work on this and realized that if I do revert the earlier change I made, it means that all artifacts would be included in the WorkflowTaskResult
, which would mean that during Artifact GC, even if the pod does attempt deletion of all artifacts, it will still have an error trying to delete some of them. And without the ForceFinalizerRemoval
flag set, then the Workflow's Finalizer will still be on, and deletion will be deemed to have failed.
For both Optional and non-Optional artifacts, it seems that we only want to attempt deletion of whatever exists and we don't want to fail Artifact GC just because we're trying to delete some artifact that doesn't exist. If it was a non-Optional artifact, we will have Failed the Workflow itself, but that doesn't mean we should also fail Artifact GC.
Therefore, I'm thinking of instead maintaining the notion that a WorkflowTaskResult
only includes the artifacts that were successfully written, but fixing the logic for it based on the problems identified by @encigem.
@agilgur5 @encigem feel free to differ with anything I'm saying here if I'm mistaken, thanks.
Thanks for going through the logic! I agree, that sounds like the most correct way to handle it.
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
What happened?:
optional: true
parameter, then the ArtGC pod which runs after the steps are terminated fails with the message:Error: You need to configure artifact storage. More information on how to do this can be found in the docs: https://argoproj.github.io/argo-workflows/configure-artifact-repository/
optional: true
config from the offending artifact and re-run, then no artifact GC pod is created when the workflow terminates from deadline exceeded.Error: You need to configure artifact storage
messages appear and the other artifacts are cleaned as expected when timeout occurs.What did I expect to happen:
Version(s)
v3.4.10, v3.5.10, latest(sha256:4f03ff7ecaef4061dddd2c08f80de4d766b253aa3a57a87e69dd3a797bb42b1e)
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container