argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.84k stars 3.17k forks source link

Cannot delete failed Workflow due to ArtifactGC finalizer #13499

Closed chengjoey closed 2 weeks ago

chengjoey commented 2 weeks ago

Pre-requisites

What happened? What did you expect to happen?

run example examples/artifact-gc-workflow.yaml

  1. kubectl create -f examples/artifact-gc-workflow.yaml
  2. because I am using a Linux/arm64 system, the workflow failed.
    kubectl get workflow
    NAME                STATUS   AGE   MESSAGE
    artifact-gc-qzcrv   Failed   10s   Error (exit code 64): failed to start command: fork/exec /bin/sh: exec format error
  3. so i decide to delete the workflow, but at this time, deleting the workflow is stuck and cannot be successful get workflow:
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
    annotations:
    workflows.argoproj.io/pod-name-format: v2
    creationTimestamp: "2024-08-24T10:08:29Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2024-08-24T10:13:21Z"
    finalizers:
    - workflows.argoproj.io/artifact-gc
    generateName: artifact-gc-
    generation: 5
    labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Failed
    name: artifact-gc-qzcrv
    namespace: default
    resourceVersion: "8957"
    uid: 4e159dab-2888-4097-a4c3-9ff67758d28d
    spec:
    ...

i found that finalizers still existed, it should be removed

Version(s)

v3.5.10

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

run examples/artifact-gc-workflow.yaml on arm mac, and then delete the workflow

Logs from the workflow controller

time="2024-08-24T10:08:40.025Z" level=info msg="node changed" namespace=default new.message="Error (exit code 64): failed to start command: fork/exec /bin/sh: exec format error" new.phase=Failed new.progress=0/1 nodeID=artifact-gc-qzcrv old.message= old.phase=Pending old.progress=0/1 workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg=reconcileAgentPod namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="Updated phase Running -> Failed" namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="Updated message  -> Error (exit code 64): failed to start command: fork/exec /bin/sh: exec format error" namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="Marking workflow completed" namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="Workflow to be dehydrated" Workflow Size=1689
time="2024-08-24T10:08:40.048Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=8669 workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.069Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/artifact-gc-qzcrv/labelPodCompleted
time="2024-08-24T10:08:45.068Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=8669 namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:45.069Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:45.069Z" level=info msg="Workflow to be dehydrated" Workflow Size=1715
time="2024-08-24T10:08:45.088Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=8680 workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:50.093Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=8680 namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:50.093Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=artifact-gc-qzcrv
time="2024-08-24T10:12:52.475Z" level=info msg="Alloc=5451 TotalAlloc=18959 Sys=24165 NumGC=8 Goroutines=175"
time="2024-08-24T10:13:31.589Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=8957 namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:13:31.589Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=artifact-gc-qzcrv
time="2024-08-24T10:13:31.590Z" level=info msg="Workflow to be dehydrated" Workflow Size=1751
time="2024-08-24T10:13:31.604Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=8967 workflow=artifact-gc-qzcrv
time="2024-08-24T10:13:36.610Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=8967 namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:13:36.610Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=artifact-gc-qzcrv

Logs from in your workflow's wait container

time="2024-08-24T10:08:31.403Z" level=info msg="Starting Workflow Executor" version=v3.5.10
time="2024-08-24T10:08:31.404Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-08-24T10:08:31.404Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=artifact-gc-qzcrv templateName=main version="&Version{Version:v3.5.10,BuildDate:2024-08-01T05:12:26Z,GitCommit:25829927431d9a0f46d17b72ae74aedb8d700884,GitTag:v3.5.10,GitTreeState:clean,GoVersion:go1.21.12,Compiler:gc,Platform:linux/arm64,}"
time="2024-08-24T10:08:31.414Z" level=info msg="Starting deadline monitor"
time="2024-08-24T10:08:32.414Z" level=info msg="Main container completed" error="<nil>"
time="2024-08-24T10:08:32.414Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-08-24T10:08:32.414Z" level=info msg="No output parameters"
time="2024-08-24T10:08:32.414Z" level=info msg="Saving output artifacts"
time="2024-08-24T10:08:32.415Z" level=info msg="Staging artifact: on-completion"
time="2024-08-24T10:08:32.415Z" level=info msg="Copying /tmp/on-completion.txt from container base image layer to /tmp/argo/outputs/artifacts/on-completion.tgz"
time="2024-08-24T10:08:32.415Z" level=info msg="/var/run/argo/outputs/artifacts/tmp/on-completion.txt.tgz -> /tmp/argo/outputs/artifacts/on-completion.tgz"
time="2024-08-24T10:08:32.415Z" level=error msg="executor error: open /var/run/argo/outputs/artifacts/tmp/on-completion.txt.tgz: no such file or directory"
time="2024-08-24T10:08:32.437Z" level=info msg="Alloc=8902 TotalAlloc=13535 Sys=24165 NumGC=3 Goroutines=8"
time="2024-08-24T10:08:32.456Z" level=fatal msg="open /var/run/argo/outputs/artifacts/tmp/on-completion.txt.tgz: no such file or directory"
chengjoey commented 2 weeks ago

kubectl get pod artifact-gc-qzcrv -o yaml:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/default-container: main
    workflows.argoproj.io/node-id: artifact-gc-qzcrv
    workflows.argoproj.io/node-name: artifact-gc-qzcrv
  creationTimestamp: "2024-08-24T10:08:30Z"
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/workflow: artifact-gc-qzcrv
  name: artifact-gc-qzcrv
  namespace: default

labels[workflows.argoproj.io/completed] is true, so pod can not get by podinformer https://github.com/argoproj/argo-workflows/blob/ddbb3c7ad5b498d50514b3c1158ded56e333d75b/workflow/controller/controller.go#L1232-L1238

This causes anyPodSuccess to always be false

https://github.com/argoproj/argo-workflows/blob/ddbb3c7ad5b498d50514b3c1158ded56e333d75b/workflow/controller/artifact_gc.go#L506-L512

https://github.com/argoproj/argo-workflows/blob/ddbb3c7ad5b498d50514b3c1158ded56e333d75b/workflow/controller/artifact_gc.go#L551

agilgur5 commented 2 weeks ago

See my review comment. This seems like the exact use-case for the forceFinalizerRemoval field

agilgur5 commented 2 weeks ago

Looks like this might be caused by failure of TaskResult reconciliation (so ArtifactGC didn't run yet), and so would have a duplicate root cause of #12993 and fixed by #13454. See my new comment on the PR

agilgur5 commented 2 weeks ago

Try running your Controller image with :latest, which will include #13454

chengjoey commented 2 weeks ago

I have tried it. In fact, I have debugged it locally with the latest code and it cannot be deleted successfully unless force is used. However, I don't think it is necessary to use force in this scenario. Looking at the code, woc.allArtifactsDeleted() is actually true, but anyPodSuccess prevents the deletion of finalizers. I think zero pods and anyPodSuccess == true are equivalent.

https://github.com/argoproj/argo-workflows/blob/ddbb3c7ad5b498d50514b3c1158ded56e333d75b/workflow/controller/artifact_gc.go#L551

agilgur5 commented 2 weeks ago

To clarify, so in this scenario, no ArtifactGC Pods were launched? Since no artifacts were created

Looking at the code, woc.allArtifactsDeleted() is actually true, but anyPodSuccess prevents the deletion of finalizers. I think zero pods and anyPodSuccess == true are equivalent.

Ah, I see what you mean, thanks for elaborating!

Yes, if all artifacts were already deleted, then I think this check is just to make sure that there are no failed ArtifactGC Pods laying around? 🤔 cc @juliev0

juliev0 commented 2 weeks ago

Hey all. This is really interesting. You've definitely happened upon at least one bug. But you know, while the LabelKeyCompleted label is being set to "false", it seems I never actually set it to "true" or deleted it, so it shouldn't actually be bypassed by that Informer. Do you agree?

(in which case, we can decide if it should even have that label)

juliev0 commented 2 weeks ago

Looking at the anyPodSuccess code, I think it was probably just for the purpose of eliminating unnecessary work by the way, to avoid doing the woc.allArtifactsDeleted() check if none of the Artifact GC Pods had even finished and succeeded yet.

juliev0 commented 2 weeks ago

Now I'm seeing in your PR that you mention that the Workflow failed due to not being able to run the image. Okay, let me respond over there...

So, it seems like root cause is really that you never had any ArtifactGC Pods to begin with, right? And the anyPodSuccess code is written with the expectation that if you have ArtifactGC configured in the Workflow, you'll have at least one Artifact that gets GC'ed.