argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.86k stars 3.17k forks source link

cannot delete workflow with artifacts #10192

Closed pockyhe closed 1 year ago

pockyhe commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

I enable Artifacts with Azure Storage Blob as artifactory. And I wrote workflow with artifacts for passing parameters. However, I found some of these workflows cannot be deleted. Whatever from Argo UI or command argo delete. Even I update Argo Workflow, all workflow records are clean except these workflows. I don't know how to delete them.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

none

Logs from the workflow controller

none

Logs from in your workflow's wait container

none
pockyhe commented 1 year ago

while updating argo workflow, I found crd workflows.argoproj.io keep Terminating status in k8s

sarabala1979 commented 1 year ago

@pockyhe Do you have any finalizer or webhook that will prevent the deletion? Can you provide more information like full workflow manifest and controller log?

pockyhe commented 1 year ago

I don't have any finalizer or webhook. I didn't find much useful logs In workflow-server: time="2022-12-09T02:31:00.716Z" level=info duration=37.926876ms method=DELETE path=/api/v1/workflows/argo-main/acquire-token-jenny size=2 status=0 In Workflow-Controller: "time="2022-12-09T02:45:00.121Z" level=info msg="Processing workflow" namespace=argo-main workflow=acquire-token-jenny time="2022-12-09T02:45:00.121Z" level=info msg="Checking daemoned children of " namespace=argo-main workflow=acquire-token-jenny"

juliev0 commented 1 year ago

Please provide a Workflow so we can reproduce this issue. I am wondering if it has to do with the new ArtifactGC feature. What that feature does is determine if your Workflow is using Artifact GC, and if so it adds a finalizer to the Workflow to prevent it from being deleted until the Artifact GC has occurred. If for some reason the Controller thinks you have Artifact GC configured but then can't delete the artifacts this could occur.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

leojeb commented 1 year ago

Please provide a Workflow so we can reproduce this issue. I am wondering if it has to do with the new ArtifactGC feature. What that feature does is determine if your Workflow is using Artifact GC, and if so it adds a finalizer to the Workflow to prevent it from being deleted until the Artifact GC has occurred. If for some reason the Controller thinks you have Artifact GC configured but then can't delete the artifacts this could occur.

That is likely to be the case. I encountered the same situation where I set workflow-level ArtifactGC to OnWorkflowDeletion in some workflows and those were the wfs that I couldn't delete by any means. While workflow deletion works fine with wfs that didn't have any ArtifactGC settings specified.

leojeb commented 1 year ago

while updating argo workflow, I found crd workflows.argoproj.io keep Terminating status in k8s

if you specified ArtifactGC settings in those wfs then you could try change those wf resource files. I tried following steps after which you should be able to delete wfs normally, but you will need to do this for every wf that has the issue:

1. kubectl get wf -nargo 
2. kubectl edit wf [your_wf_name] -nargo 
find keyword ArtifactGC and delete entries found , e.g. 

   artifactGC:
     strategy: OnWorkflowDeletion
   artifactGCStatus:
     strategiesProcessed:
       OnWorkflowCompletion: true
       OnWorkflowDeletion: true

find keyword finalizers and delete its entries, e.g.
   finalizers:
   - workflows.argoproj.io/artifact-gc

3. save it and try delete it again:
4. kubectl delete --force wf [your_wf_name] -nargo 

One way of doing these steps a bit faster is to delete all faulted wfs through Argo UI(they will remain on the web ui) and then u only need to edit those wfs with kubectl edit as mentioned above. The wfs would be automatically delete by k8s after u finish editing them.

juliev0 commented 1 year ago

@leojeb Sorry you had to go through that hassle. Another way to delete these is to use the CLI and use the new "--force" option for "argo delete" (described here).

hnougher commented 1 year ago

I have the same issue, except with an S3 storage. All was fine until I added the cleanup on workflow deletion.

juliev0 commented 1 year ago

@hnougher @pockyhe Sorry for any hassle. A design decision was made that in the case that Artifact Garbage Collection fails, the Workflow shouldn't be deleted and also the Pods that are used to delete the artifacts should remain so their logs can be viewed to see what went wrong. I may need to clarify the documentation as far as this.

Were you able to determine why the garbage collection wasn't successful? You should see pods with "wfcomp" and "wfdel" in the name and with Label workflows.argoproj.io/workflow set to your Workflow, and you can view the logs in those. Also, there should be one or more "Conditions" in your Workflow's Status that would contain the error message.

Make sure you follow the guidance as far as rolebinding in this section.

pockyhe commented 1 year ago

Sorry to reply so late. I delete them successfully by changing the finalizers while deleting workflow crd. @leojeb as you mentioned, I did set workflow-level ArtifactGC to OnWorkflowDeletion in some workflows and found they cannot be deleted. after having been setted, even I recover this setting, it still caused this problem for some workflow, which similar to @hnougher. @juliev0 I'm very thankful for you to analyze and explain this problem

hnougher commented 1 year ago

Hi @juliev0 . Regarding "Pods that are used to delete the artifacts", I cannot see any evidence of this happening for me at all.

I do see that the workflow status is getting updated to show it has processed it, but that is all. The artifacts still exist as well.

    strategiesProcessed:
      OnWorkflowCompletion: true
      OnWorkflowDeletion: true

I have attempted to use the kubelet/containerd node logs to locate if the pods wfcomp/wfdel ran at all, without success.

I have also tried adjusting the service account a little, with no difference to the observations above.

juliev0 commented 1 year ago

Hi @juliev0 . Regarding "Pods that are used to delete the artifacts", I cannot see any evidence of this happening for me at all.

I do see that the workflow status is getting updated to show it has processed it, but that is all. The artifacts still exist as well.

    strategiesProcessed:
      OnWorkflowCompletion: true
      OnWorkflowDeletion: true

I have attempted to use the kubelet/containerd node logs to locate if the pods wfcomp/wfdel ran at all, without success.

I have also tried adjusting the service account a little, with no difference to the observations above.

Can you please attach your Workflow (and any WorkflowTemplate it may reference) plus your Workflow Controller log so I can look into it?

hnougher commented 1 year ago

Let's use the example in the guide without the override. This has the issue. Uses the default repository key of the namespace.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: artifact-gc-
  namespace: pstools-main
spec:
  entrypoint: main
  serviceAccountName: argo-executor  # with or without this
  artifactGC:
    strategy: OnWorkflowDeletion
  templates:
    - name: main
      container:
        image: argoproj/argosay:v2
        command:
          - sh
          - -c
        args:
          - |
            echo "can throw this away" > /tmp/temporary-artifact.txt
      outputs:
        artifacts:
          - name: temporary-artifact
            path: /tmp/temporary-artifact.txt

And the log file around a single run of this workflow with deletion. argo-workflows-controller-58985bbf6-b2g82.log

Also, new learning, if I edit the workflow yaml that is supposed to be already deleted to remove the "finalizer", all associated pods disappear in less than a second. I use K8s Lens viewing it real time. The S3 artifacts still exist though.

juliev0 commented 1 year ago

argo-workflows-controller-58985bbf6-b2g82.log

Thanks for sharing that. One question: I see you are using the "default repository key of the namespace" and that appears to be the main difference from the example, right? If you specify a key instead does it work for you? If that's the case, can you help me reproduce? are you using the artifact-repositories configmap, and which keys there are you specifying?

I'm looking at your log file and I see that a pod was started to perform ArtifactGC:

time="2023-01-10T06:45:13.963Z" level=info msg="creating pod to delete artifacts: artifact-gc-gknlr-artgc-wfdel-2166136261" namespace=pstools-main strategy=OnWorkflowDeletion workflow=artifact-gc-gknlr
time="2023-01-10T06:45:13.972Z" level=info msg="Create pods 201"

It appears that it failed:

time="2023-01-10T06:45:23.982Z" level=info msg="reconciling artifact-gc pod" message= namespace=pstools-main phase=Failed pod=artifact-gc-gknlr-artgc-wfdel-2166136261 workflow=artifact-gc-gknlr

Do you not see a Pod with that name? Can you do kubectl logs on it to see what it says? (also, your WorkflowStatus should show a Condition with an error message from that log)

hnougher commented 1 year ago

Tell you what, finding a log for a pod that never really run is very hard, since wildcards do not work. And it appears kube deletes the log for the pod within a minute or so, making it a race to locate it. But I did end up eyeing it in /var/log/containers of a node before it disappeared. The wfdel pod tried to start as the "default" service account, which doesn't have permission for listing workflowartifactgctasks.

Workflow conditions show nothing. image

Duplicated the "serviceAccountName" into the "artifactGC" section, and it works. I did not realise it was not inheriting the workflow's service account.

Issue 1: Workflows interface does not show errors encountered during GC. Issue 2: I think the GC should use the workflow's service account by default.

juliev0 commented 1 year ago

Tell you what, finding a log for a pod that never really run is very hard, since wildcards do not work. And it appears kube deletes the log for the pod within a minute or so, making it a race to locate it. But I did end up eyeing it in /var/log/containers of a node before it disappeared. The wfdel pod tried to start as the "default" service account, which doesn't have permission for listing workflowartifactgctasks.

Workflow conditions show nothing. image

Duplicated the "serviceAccountName" into the "artifactGC" section, and it works. I did not realise it was not inheriting the workflow's service account.

Issue 1: Workflows interface does not show errors encountered during GC. Issue 2: I think the GC should use the workflow's service account by default.

Which version of Argo Workflows are you running? v3.4.4?

hnougher commented 1 year ago

Well... it was Bitnami Helm chart 5.1.0. Then just now I noticed it was using "workflows-server" image 3.4.4 and "workflows-controller" 3.4.3. Update chart to 5.1.1, and the controller is now 3.4.4.

Run the test again. Now I can see the pod being kept open and condition populated on workflow. Issue 1 fixed by silly release mistake...

juliev0 commented 1 year ago

Well... it was Bitnami Helm chart 5.1.0. Then just now I noticed it was using "workflows-server" image 3.4.4 and "workflows-controller" 3.4.3. Update chart to 5.1.1, and the controller is now 3.4.4.

Run the test again. Now I can see the pod being kept open and condition populated on workflow. Issue 1 fixed by silly release mistake...

Great. I was hoping for that.

As for your issue 2, I see where you're coming from. Currently, the ArtifactGC ServiceAccount is specified on the artifact level and on the Workflow level, where the artifact level can override the Workflow level. If we were to add in the back up of using the Workflow level ServiceAccount then we should probably also have a back up of using the regular template-level ServiceAccount. If we have all 4, then what would be the order of precedence? Maybe:

  1. Artifact-level GC SA
  2. Workflow-level GC SA
  3. Template-level regular SA
  4. Workflow-level regular SA I suppose there was a decision to make it so that it needed to be explicitly defined as part of ArtifactGC so there was no ambiguity as far as the order of precedence, but perhaps that decision could be re-evaluated since it tripped you up. Feel free to add an Enhancement issue if you like.
hnougher commented 1 year ago

There is still something wrong here. I have set the service account to the global defaults, which works for the example case but not my complex workflow. The global default was added after adding GC to every workflow everywhere and not working. And it is the GC pod not being created again.

I cannot share my set of workflows, so I will have to work out what is going on to make a simplified case. The general structure that I suspect the issue is inside:

  1. WorkflowTemplate A Template A DAG calls WorkflowTemplate A Template B one or more times with different arguments.
  2. WorkflowTemplate A Template B DAG calls WorkflowTemplate B Template A which creates the artifact (templateRef).
  3. WorkflowTemplate A Template B DAG calls WorkflowTemplate C Template A which uses the artifact (templateRef).

I hope that makes sense.

juliev0 commented 1 year ago

There is still something wrong here. I have set the service account to the global defaults, which works for the example case but not my complex workflow. The global default was added after adding GC to every workflow everywhere and not working. And it is the GC pod not being created again.

I cannot share my set of workflows, so I will have to work out what is going on to make a simplified case. The general structure that I suspect the issue is inside:

  1. WorkflowTemplate A Template A DAG calls WorkflowTemplate A Template B one or more times with different arguments.
  2. WorkflowTemplate A Template B DAG calls WorkflowTemplate B Template A which creates the artifact (templateRef).
  3. WorkflowTemplate A Template B DAG calls WorkflowTemplate C Template A which uses the artifact (templateRef).

I hope that makes sense.

Do you have any Workflow Controller log you can provide? and do you see any "Condition" on the Workflow Status?

If you inspect the Workflow Status, you should see that the nodes each have a status - what we should garbage collect is any node in there with an output artifacts whose GC strategy is set.

juliev0 commented 1 year ago

@hnougher Where is the ArtifactGC Strategy set in your example?

Andykmcc commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

I'm using cronworkflows with artifacts stored in min.io. Old workflow runs for one of the cronworkflows does not get deleted when it should (after 24 hours). The workflows are stuck in a "pending deletion" state. I can clean them by removing the artifactGC finalizer from the manifest but I should need to do this. I have another cron workflow that is very similar in the same namespace that works properly. I've tried deleting the troublesome cronworkflow and recreating to no avail. i've also created both workflows in another namespace with same result. They both use the same serviceAccount.

it did cleanup old pods when I told it too, which I recently disabled so I can easily grab logs.

Version

v3.4.4

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: photon
spec:
  imagePullSecrets:
  - name: dockerconfigjson-github-com
  entrypoint: generate-db
  templates:
  - name: generate-db
    inputs:
      parameters:
        - name: photon-db-src-path
        - name: dest-bucket
        - name: dest-key
    outputs:
      artifacts:
      - name: photon-db
        path: "{{inputs.parameters.photon-db-src-path}}"
        s3:
          bucket: "{{inputs.parameters.dest-bucket}}"
          key: "{{inputs.parameters.dest-key}}"
          endpoint: minio.techlabor.org:9000
          insecure: true
          accessKeySecret:
            name: argo-workflow-artifact-minio-creds
            key: accessKey
          secretKeySecret:
            name: argo-workflow-artifact-minio-creds
            key: secretKey
        artifactGC:
          strategy: Never
    container:
      image: ghcr.io/bikehopper/photon-db-nominatim-importer:v2.0.0
      command: ["/usr/app/build.sh"]
      resources:
        requests:
          memory: "3Gi"
          cpu: "3000m"
        limits:
          memory: "6Gi"
          cpu: "4000m"
      envFrom:
        - secretRef:
            name: minio-photon
      env:
        - name: MINIO_HOST
          value: http://minio.techlabor.org:9000
        - name: NOMINATIM_PASSWORD
          valueFrom:
            secretKeyRef:
              name: nominatim-db
              key: password
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
---
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: build-photon-db
spec:
  schedule: "0 2,14 * * *"
  concurrencyPolicy: "Replace"
  startingDeadlineSeconds: 0
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  workflowSpec:
    imagePullSecrets:
    - name: dockerconfigjson-github-com
    entrypoint: build-photon-db
    artifactGC:
      strategy: OnWorkflowDeletion
    templates:
    - name: build-photon-db
      steps:
      - - name: photon
          templateRef:
            name: photon
            template: generate-db
          arguments:
            parameters:
            - name: photon-db-src-path
              value: /usr/app/photon_data
            - name: dest-bucket
              value: photon-staging
            - name: dest-key
              value: /elasticsearch/photon_data.tgz

Logs from the workflow controller

time="2023-01-28T04:42:10.938Z" level=info msg="Enforcing history limit for 'build-graph-cache'" namespace=staging workflow=build-graph-cache
time="2023-01-28T04:42:10.938Z" level=info msg="Enforcing history limit for 'build-photon-db'" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.952Z" level=info msg="Delete workflows 200"
time="2023-01-28T04:42:10.952Z" level=info msg="Deleted Workflow 'build-photon-db-1674828000' due to CronWorkflow 'build-photon-db' history limit" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.961Z" level=info msg="Delete workflows 200"
time="2023-01-28T04:42:10.961Z" level=info msg="Deleted Workflow 'build-photon-db-1674784800' due to CronWorkflow 'build-photon-db' history limit" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.966Z" level=info msg="Delete workflows 200"
time="2023-01-28T04:42:10.966Z" level=info msg="Deleted Workflow 'build-photon-db-swb7d' due to CronWorkflow 'build-photon-db' history limit" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.970Z" level=info msg="Delete workflows 200"
time="2023-01-28T04:42:10.970Z" level=info msg="Deleted Workflow 'build-photon-db-5q96d' due to CronWorkflow 'build-photon-db' history limit" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.974Z" level=info msg="Delete workflows 200"
...

Logs from in your workflow's wait container

time="2023-01-27T02:00:02.272Z" level=info msg="Starting Workflow Executor" version=v3.4.4
time="2023-01-27T02:00:02.274Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-01-27T02:00:02.274Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=staging podName=build-photon-db-1674784800-generate-db-1890044021 template="{\"name\":\"generate-db\",\"inputs\":{\"parameters\":[{\"name\":\"photon-db-src-path\",\"value\":\"/usr/app/photon_data\"},{\"name\":\"dest-bucket\",\"value\":\"photon-staging\"},{\"name\":\"dest-key\",\"value\":\"/elasticsearch/photon_data.tgz\"}]},\"outputs\":{\"artifacts\":[{\"name\":\"photon-db\",\"path\":\"/usr/app/photon_data\",\"s3\":{\"endpoint\":\"minio.techlabor.org:9000\",\"bucket\":\"photon-staging\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"argo-workflow-artifact-minio-creds\",\"key\":\"accessKey\"},\"secretKeySecret\":{\"name\":\"argo-workflow-artifact-minio-creds\",\"key\":\"secretKey\"},\"key\":\"/elasticsearch/photon_data.tgz\"},\"artifactGC\":{\"strategy\":\"Never\"}}]},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"ghcr.io/bikehopper/photon-db-nominatim-importer:v2.0.0\",\"command\":[\"/usr/app/build.sh\"],\"envFrom\":[{\"secretRef\":{\"name\":\"minio-photon\"}}],\"env\":[{\"name\":\"MINIO_HOST\",\"value\":\"http://minio.techlabor.org:9000\"},{\"name\":\"NOMINATIM_PASSWORD\",\"valueFrom\":{\"secretKeyRef\":{\"name\":\"nominatim-db\",\"key\":\"password\"}}},{\"name\":\"POD_NAMESPACE\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.namespace\"}}}],\"resources\":{\"limits\":{\"cpu\":\"4\",\"memory\":\"6Gi\"},\"requests\":{\"cpu\":\"3\",\"memory\":\"3Gi\"}}}}" version="&Version{Version:v3.4.4,BuildDate:2022-11-29T16:49:53Z,GitCommit:3b2626ff900aff2424c086a51af5929fb0b2d7e5,GitTag:v3.4.4,GitTreeState:clean,GoVersion:go1.18.8,Compiler:gc,Platform:linux/amd64,}"
time="2023-01-27T02:00:02.274Z" level=info msg="Starting deadline monitor"
time="2023-01-27T02:05:02.274Z" level=info msg="Alloc=6127 TotalAlloc=12422 Sys=24274 NumGC=6 Goroutines=7"
time="2023-01-27T02:10:02.274Z" level=info msg="Alloc=6147 TotalAlloc=12529 Sys=24530 NumGC=8 Goroutines=7"
time="2023-01-27T02:12:53.423Z" level=info msg="Main container completed" error="<nil>"
time="2023-01-27T02:12:53.423Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-01-27T02:12:53.423Z" level=info msg="No output parameters"
time="2023-01-27T02:12:53.423Z" level=info msg="Saving output artifacts"
time="2023-01-27T02:12:53.424Z" level=info msg="Staging artifact: photon-db"
time="2023-01-27T02:12:53.424Z" level=info msg="Copying /usr/app/photon_data from container base image layer to /tmp/argo/outputs/artifacts/photon-db.tgz"
time="2023-01-27T02:12:53.424Z" level=info msg="/var/run/argo/outputs/artifacts/usr/app/photon_data.tgz -> /tmp/argo/outputs/artifacts/photon-db.tgz"
time="2023-01-27T02:12:53.637Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/photon-db.tgz, key: /elasticsearch/photon_data.tgz"
time="2023-01-27T02:12:53.637Z" level=info msg="Creating minio client using static credentials" endpoint="minio.techlabor.org:9000"
time="2023-01-27T02:12:53.637Z" level=info msg="Saving file to s3" bucket=photon-staging endpoint="minio.techlabor.org:9000" key=/elasticsearch/photon_data.tgz path=/tmp/argo/outputs/artifacts/photon-db.tgz
time="2023-01-27T02:13:01.132Z" level=info msg="Save artifact" artifactName=photon-db duration=7.494728595s error="<nil>" key=/elasticsearch/photon_data.tgz
time="2023-01-27T02:13:01.132Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/artifacts/photon-db.tgz
time="2023-01-27T02:13:01.132Z" level=info msg="Successfully saved file: /tmp/argo/outputs/artifacts/photon-db.tgz"
time="2023-01-27T02:13:01.149Z" level=info msg="Create workflowtaskresults 201"
time="2023-01-27T02:13:01.150Z" level=info msg="Deadline monitor stopped"
time="2023-01-27T02:13:01.150Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-01-27T02:13:01.150Z" level=info msg="Alloc=14950 TotalAlloc=64944 Sys=33234 NumGC=17 Goroutines=13"
Stream closed EOF for staging/build-photon-db-1674784800-generate-db-1890044021 (wait)
juliev0 commented 1 year ago

@Andykmcc A design decision was made to keep the Workflow around in the case of artifact GC failure. You should see the reason why your deletion failed to occur in various places:

  1. you should see a pod whose name contains your workflow name + "-artgc-wfdel". If you do kubectl logs for that pod it should give you information on why it failed (those pods get left around until the Workflow is deleted)
  2. some of that information should be surfaced up into the Workflow Status as a Condition
  3. it should be in the Workflow Controller log as well
juliev0 commented 1 year ago

Should it be configurable that if GC fails the Workflow still gets deleted anyway?

Andykmcc commented 1 year ago

I found workaround for now. By adding a nonsense step to the workflow that outputs an artifact file the workflows started to get cleaned up. My default artifactGC strategy is OnWorkflowDeletion. the workflow in question only had one step which set its own step specific artifactGC strategy, never. I speculated maybe that was creating an issue, so I added this extra "print-message" nonsense step to see if having a step that actually needed GC would change anything. It did.

you should see a pod whose name contains your workflow name + "-artgc-wfdel". If you do kubectl logs for that pod it should give you information on why it failed (those pods get left around until the Workflow is deleted)

The -artgc-wfdel pod is cleaned up immediately so it is difficult to get the logs from it (my LMA stack is in progress). The workflow take 13-ish minutes to complete so iterating on this is slow. I'll try to reproduce this bug in a faster workflow so I can grab the container logs more easily. maybe that will shed light on what I explained above.

Should it be configurable that if GC fails the Workflow still gets deleted anyway?

I'm indifferent to this at the moment. Eventually I want it so if GC fails the workflow does not get clean up. then I'll see it hanging around in the UI/CLI/metrics and will know I need to take action to avoid filling up my artifact storage.

juliev0 commented 1 year ago

I found workaround for now. By adding a nonsense step to the workflow that outputs an artifact file the workflows started to get cleaned up. My default artifactGC strategy is OnWorkflowDeletion. the workflow in question only had one step which set its own step specific artifactGC strategy, never. I speculated maybe that was creating an issue, so I added this extra "print-message" nonsense step to see if having a step that actually needed GC would change anything. It did.

you should see a pod whose name contains your workflow name + "-artgc-wfdel". If you do kubectl logs for that pod it should give you information on why it failed (those pods get left around until the Workflow is deleted)

The -artgc-wfdel pod is cleaned up immediately so it is difficult to get the logs from it (my LMA stack is in progress). The workflow take 13-ish minutes to complete so iterating on this is slow. I'll try to reproduce this bug in a faster workflow so I can grab the container logs more easily. maybe that will shed light on what I explained above.

Should it be configurable that if GC fails the Workflow still gets deleted anyway?

I'm indifferent to this at the moment. Eventually I want it so if GC fails the workflow does not get clean up. then I'll see it hanging around in the UI/CLI/metrics and will know I need to take action to avoid filling up my artifact storage.

Interesting finding. Thanks for attaching your Workflow. I'll try reproducing it.

juliev0 commented 1 year ago

@Andykmcc I think I see where in the code it's doing exactly what you say - I will fix it. Probably no need to do any more troubleshooting on your part. Thanks.

juliev0 commented 1 year ago

@Andykmcc this PR fixes the issue you were seeing where the Workflow is assumed to have ArtifactGC (and add the finalizer) if it's defined on the Workflow level but overridden as "Never" on the Artifact level

hnougher commented 1 year ago

Hi @juliev0 , I had trouble finding time to get back to it.

I have been able to create a test case for my direct issue. The key item is that both the first and second are in different WorkflowTemplate. If "test" is a Workflow instead of a WorkflowTemplate, it will clean up fine. Hopefully it is the same cause, else this may need splitting to another issue.

  1. Run "test" from the GUI.
  2. Press delete once finished.
  3. The GC pod never appears or runs.
  4. Workflow marked as having processed GC with no error, but not deleting either artifact or workflow.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: test
spec:
  entrypoint: test
  templates:
  - name: test
    dag:
      tasks:
      - name: collect-single
        templateRef:
          name: artifact-passing-test
          template: whalesay
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: artifact-passing-test
spec:
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world | tee /tmp/hello_world.txt"]
    outputs:
      artifacts:
      - name: hello-art
        path: /tmp/hello_world.txt

with Helm chart workflow template of

  workflowDefaults: |
    spec:
      artifactGC:
        strategy: OnWorkflowDeletion
        serviceAccountName: argo-executor
      serviceAccountName: argo-executor

And s3 details in the namespace defaults.

juliev0 commented 1 year ago

Hi @juliev0 , I had trouble finding time to get back to it.

I have been able to create a test case for my direct issue. The key item is that both the first and second are in different WorkflowTemplate. If "test" is a Workflow instead of a WorkflowTemplate, it will clean up fine. Hopefully it is the same cause, else this may need splitting to another issue.

  1. Run "test" from the GUI.
  2. Press delete once finished.
  3. The GC pod never appears or runs.
  4. Workflow marked as having processed GC with no error, but not deleting either artifact or workflow.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: test
spec:
  entrypoint: test
  templates:
  - name: test
    dag:
      tasks:
      - name: collect-single
        templateRef:
          name: artifact-passing-test
          template: whalesay
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: artifact-passing-test
spec:
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world | tee /tmp/hello_world.txt"]
    outputs:
      artifacts:
      - name: hello-art
        path: /tmp/hello_world.txt

with Helm chart workflow template of

  workflowDefaults: |
    spec:
      artifactGC:
        strategy: OnWorkflowDeletion
        serviceAccountName: argo-executor
      serviceAccountName: argo-executor

And s3 details in the namespace defaults.

v3.4.5 just got released. Did you try with that?

hnougher commented 1 year ago

Hi @juliev0 . No sign of 3.4.5 under the Bitnami Helm chart at this time.

hnougher commented 1 year ago

Latest Bitnami Helm chart 19.1.7 contains Controller 3.4.4, Exec 3.4.4 and Server 3.4.5. My test case still fails.

juliev0 commented 1 year ago

Hmm. I’m not sure who produces that. So, mainly it’s the Controller that needs to be 3.4.5.

hnougher commented 1 year ago

I submitted a ticket to Bitnami https://github.com/bitnami/charts/issues/14805#issue-1577266898 They have remade the Helm with all the latest version.

And I can now confirm that both my test case and live workflows are all cleaning up perfectly! Such a relief. Thank you for solving it.

drewterry commented 1 year ago

I am experiencing the same issue and cannot delete workflows with artifactGC set in 3.4.5. Any other ideas?

Having re-read this thread, it would appear that I probably need to review the logs in wfdel containers.

However, I currently have ~30 workflows that will not delete, even using the --force option with the Argo CLI. Will I have to manually remove the finalizer from each one?

drewterry commented 1 year ago

I was able to delete the workflows by removing the finalizer entry manually. Note that the --force option did not work.

Below are the logs from my wfdel and wfcomp pods. I see no apparent issues, can anyone point me in the right direction to understand the issue here?

kubectl logs workflow-z5pz2-artgc-wfdel-1494256823
time="2023-03-23T18:36:21.278Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log"
time="2023-03-23T18:36:21.278Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.493Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log
time="2023-03-23T18:36:21.611Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log"
time="2023-03-23T18:36:21.611Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.667Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log

kubectl logs workflow-z5pz2-artgc-wfcomp-1494256823
time="2023-03-23T18:36:21.146Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz"
time="2023-03-23T18:36:21.146Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.429Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz
juliev0 commented 1 year ago

I was able to delete the workflows by removing the finalizer entry manually. Note that the --force option did not work.

Below are the logs from my wfdel and wfcomp pods. I see no apparent issues, can anyone point me in the right direction to understand the issue here?

kubectl logs workflow-z5pz2-artgc-wfdel-1494256823
time="2023-03-23T18:36:21.278Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log"
time="2023-03-23T18:36:21.278Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.493Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log
time="2023-03-23T18:36:21.611Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log"
time="2023-03-23T18:36:21.611Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.667Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log

kubectl logs workflow-z5pz2-artgc-wfcomp-1494256823
time="2023-03-23T18:36:21.146Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz"
time="2023-03-23T18:36:21.146Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.429Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz

Yeah, I don't see an error in those logs. Do you see any "Condition" added to the Workflow Status? If it's repeatable with a given Workflow maybe you can write up a new bug and direct me to it?

everestas commented 1 year ago

I am getting the same issue as everyone else here. However, I am trying to replicate the same issue with a simpler workflow and can't find what is triggering this issue. What I noticed is that when I am deleting a healthy workflow, argo would spawn a new pod which takes care of deleting artifact and then workflow gets deleted.

With unhealthy worklow I don't see the same, it doesn't spawn any pod and workflow doesn't get deleted.

By unhealthy I mean that it shows that it has error "Artifact garbage collection failed".

So, I tried one thing. I ran this worklow:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:  
  name: start-deployment
spec:
  entrypoint: init-dag
  artifactRepositoryRef:
    configMap: workflow-controller-configmap
    key: artifactRepository
  archiveLogs: true
  podGC:
    strategy: OnWorkflowCompletion
  artifactGC:
    strategy: OnWorkflowDeletion
  templates:
    - name: init-dag
      dag:
        tasks:
          - name: start-dp
            template: start-deployment
      outputs:
        artifacts:
        - name: index
          from: "{{tasks.start-dp.outputs.artifacts.index}}"
    - name: start-deployment
      outputs:
        artifacts:
        - name: index
          path: /tmp/index.html
      container:
        name: ''
        image: alpine/git
        command:
          - sh
          - '-c'
        args:
          - >-
            git clone https://github.com/octocat/Spoon-Knife.git
            && cd Spoon-Knife
            && cat index.html > /tmp/index.html

Then I manually deleted artifacts and then I tried to delete the workflow and it didn't get deleted. It means that if Argo doesn't find corresponding artifact for the workflow, the workflow doesn't get deleted whatsoever. I think this shouldn't be the case and workflow should be deleted even if artifact doesn't exist anymore.

Note: I save my artifacts using blobNameFormat: "{{workflow.name}}/{{pod.name}}"

drewterry commented 1 year ago

I was able to delete the workflows by removing the finalizer entry manually. Note that the --force option did not work. Below are the logs from my wfdel and wfcomp pods. I see no apparent issues, can anyone point me in the right direction to understand the issue here?

kubectl logs workflow-z5pz2-artgc-wfdel-1494256823
time="2023-03-23T18:36:21.278Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log"
time="2023-03-23T18:36:21.278Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.493Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log
time="2023-03-23T18:36:21.611Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log"
time="2023-03-23T18:36:21.611Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.667Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log

kubectl logs workflow-z5pz2-artgc-wfcomp-1494256823
time="2023-03-23T18:36:21.146Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz"
time="2023-03-23T18:36:21.146Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.429Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz

Yeah, I don't see an error in those logs. Do you see any "Condition" added to the Workflow Status? If it's repeatable with a given Workflow maybe you can write up a new bug and direct me to it?

Thanks for taking a look @juliev0, I submitted a new bug in #10840

juliev0 commented 1 year ago

I am getting the same issue as everyone else here. However, I am trying to replicate the same issue with a simpler workflow and can't find what is triggering this issue. What I noticed is that when I am deleting a healthy workflow, argo would spawn a new pod which takes care of deleting artifact and then workflow gets deleted.

With unhealthy worklow I don't see the same, it doesn't spawn any pod and workflow doesn't get deleted.

By unhealthy I mean that it shows that it has error "Artifact garbage collection failed".

So, I tried one thing. I ran this worklow:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:  
  name: start-deployment
spec:
  entrypoint: init-dag
  artifactRepositoryRef:
    configMap: workflow-controller-configmap
    key: artifactRepository
  archiveLogs: true
  podGC:
    strategy: OnWorkflowCompletion
  artifactGC:
    strategy: OnWorkflowDeletion
  templates:
    - name: init-dag
      dag:
        tasks:
          - name: start-dp
            template: start-deployment
      outputs:
        artifacts:
        - name: index
          from: "{{tasks.start-dp.outputs.artifacts.index}}"
    - name: start-deployment
      outputs:
        artifacts:
        - name: index
          path: /tmp/index.html
      container:
        name: ''
        image: alpine/git
        command:
          - sh
          - '-c'
        args:
          - >-
            git clone https://github.com/octocat/Spoon-Knife.git
            && cd Spoon-Knife
            && cat index.html > /tmp/index.html

Then I manually deleted artifacts and then I tried to delete the workflow and it didn't get deleted. It means that if Argo doesn't find corresponding artifact for the workflow, the workflow doesn't get deleted whatsoever. I think this shouldn't be the case and workflow should be deleted even if artifact doesn't exist anymore.

Note: I save my artifacts using blobNameFormat: "{{workflow.name}}/{{pod.name}}"

There is a Finalizer on the Workflow which prevents Workflow deletion until all artifacts have been deleted. Did you inspect your Workflow's Status (kubectl get workflow <name> -o yaml) to see if there was more information about why the ArtifactGC failed? Also, you can look at the Workflow Controller log file and the GC Pod logs. There is now this enhancement issue which you might be interested in that I have submitted a PR for.

juliev0 commented 1 year ago

Hi all. I have submitted this enhancement, and this PR to address it.