Closed chengjoey closed 2 weeks ago
kubectl get pod artifact-gc-qzcrv -o yaml
:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/default-container: main
workflows.argoproj.io/node-id: artifact-gc-qzcrv
workflows.argoproj.io/node-name: artifact-gc-qzcrv
creationTimestamp: "2024-08-24T10:08:30Z"
labels:
workflows.argoproj.io/completed: "true"
workflows.argoproj.io/workflow: artifact-gc-qzcrv
name: artifact-gc-qzcrv
namespace: default
labels[workflows.argoproj.io/completed]
is true
, so pod can not get by podinformer
https://github.com/argoproj/argo-workflows/blob/ddbb3c7ad5b498d50514b3c1158ded56e333d75b/workflow/controller/controller.go#L1232-L1238
This causes anyPodSuccess to always be false
See my review comment. This seems like the exact use-case for the forceFinalizerRemoval
field
Looks like this might be caused by failure of TaskResult reconciliation (so ArtifactGC didn't run yet), and so would have a duplicate root cause of #12993 and fixed by #13454. See my new comment on the PR
Try running your Controller image with :latest
, which will include #13454
I have tried it. In fact, I have debugged it locally with the latest code and it cannot be deleted successfully unless force
is used. However, I don't think it is necessary to use force
in this scenario. Looking at the code, woc.allArtifactsDeleted()
is actually true, but anyPodSuccess
prevents the deletion of finalizers
. I think zero pods and anyPodSuccess == true are equivalent.
To clarify, so in this scenario, no ArtifactGC Pods were launched? Since no artifacts were created
Looking at the code,
woc.allArtifactsDeleted()
is actually true, butanyPodSuccess
prevents the deletion offinalizers
. I think zero pods and anyPodSuccess == true are equivalent.
Ah, I see what you mean, thanks for elaborating!
Yes, if all artifacts were already deleted, then I think this check is just to make sure that there are no failed ArtifactGC Pods laying around? 🤔 cc @juliev0
Hey all. This is really interesting. You've definitely happened upon at least one bug. But you know, while the LabelKeyCompleted
label is being set to "false", it seems I never actually set it to "true" or deleted it, so it shouldn't actually be bypassed by that Informer. Do you agree?
(in which case, we can decide if it should even have that label)
Looking at the anyPodSuccess
code, I think it was probably just for the purpose of eliminating unnecessary work by the way, to avoid doing the woc.allArtifactsDeleted() check if none of the Artifact GC Pods had even finished and succeeded yet.
Now I'm seeing in your PR that you mention that the Workflow failed due to not being able to run the image. Okay, let me respond over there...
So, it seems like root cause is really that you never had any ArtifactGC Pods to begin with, right? And the anyPodSuccess
code is written with the expectation that if you have ArtifactGC configured in the Workflow, you'll have at least one Artifact that gets GC'ed.
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
run example
examples/artifact-gc-workflow.yaml
kubectl create -f examples/artifact-gc-workflow.yaml
i found that
finalizers
still existed, it should be removedVersion(s)
v3.5.10
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
run
examples/artifact-gc-workflow.yaml
on arm mac, and then delete the workflowLogs from the workflow controller
Logs from in your workflow's wait container