Closed pockyhe closed 1 year ago
while updating argo workflow, I found crd workflows.argoproj.io keep Terminating status in k8s
@pockyhe Do you have any finalizer or webhook that will prevent the deletion? Can you provide more information like full workflow manifest and controller log?
I don't have any finalizer or webhook. I didn't find much useful logs In workflow-server: time="2022-12-09T02:31:00.716Z" level=info duration=37.926876ms method=DELETE path=/api/v1/workflows/argo-main/acquire-token-jenny size=2 status=0 In Workflow-Controller: "time="2022-12-09T02:45:00.121Z" level=info msg="Processing workflow" namespace=argo-main workflow=acquire-token-jenny time="2022-12-09T02:45:00.121Z" level=info msg="Checking daemoned children of " namespace=argo-main workflow=acquire-token-jenny"
Please provide a Workflow so we can reproduce this issue. I am wondering if it has to do with the new ArtifactGC feature. What that feature does is determine if your Workflow is using Artifact GC, and if so it adds a finalizer to the Workflow to prevent it from being deleted until the Artifact GC has occurred. If for some reason the Controller thinks you have Artifact GC configured but then can't delete the artifacts this could occur.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Please provide a Workflow so we can reproduce this issue. I am wondering if it has to do with the new ArtifactGC feature. What that feature does is determine if your Workflow is using Artifact GC, and if so it adds a finalizer to the Workflow to prevent it from being deleted until the Artifact GC has occurred. If for some reason the Controller thinks you have Artifact GC configured but then can't delete the artifacts this could occur.
That is likely to be the case. I encountered the same situation where I set workflow-level ArtifactGC to OnWorkflowDeletion in some workflows and those were the wfs that I couldn't delete by any means. While workflow deletion works fine with wfs that didn't have any ArtifactGC settings specified.
while updating argo workflow, I found crd workflows.argoproj.io keep Terminating status in k8s
if you specified ArtifactGC settings in those wfs then you could try change those wf resource files. I tried following steps after which you should be able to delete wfs normally, but you will need to do this for every wf that has the issue:
1. kubectl get wf -nargo
2. kubectl edit wf [your_wf_name] -nargo
find keyword ArtifactGC and delete entries found , e.g.
artifactGC:
strategy: OnWorkflowDeletion
artifactGCStatus:
strategiesProcessed:
OnWorkflowCompletion: true
OnWorkflowDeletion: true
find keyword finalizers and delete its entries, e.g.
finalizers:
- workflows.argoproj.io/artifact-gc
3. save it and try delete it again:
4. kubectl delete --force wf [your_wf_name] -nargo
One way of doing these steps a bit faster is to delete all faulted wfs through Argo UI(they will remain on the web ui) and then u only need to edit those wfs with kubectl edit
as mentioned above. The wfs would be automatically delete by k8s after u finish editing them.
@leojeb Sorry you had to go through that hassle. Another way to delete these is to use the CLI and use the new "--force" option for "argo delete" (described here).
I have the same issue, except with an S3 storage. All was fine until I added the cleanup on workflow deletion.
@hnougher @pockyhe Sorry for any hassle. A design decision was made that in the case that Artifact Garbage Collection fails, the Workflow shouldn't be deleted and also the Pods that are used to delete the artifacts should remain so their logs can be viewed to see what went wrong. I may need to clarify the documentation as far as this.
Were you able to determine why the garbage collection wasn't successful? You should see pods with "wfcomp" and "wfdel" in the name and with Label workflows.argoproj.io/workflow
set to your Workflow, and you can view the logs in those. Also, there should be one or more "Conditions" in your Workflow's Status that would contain the error message.
Make sure you follow the guidance as far as rolebinding in this section.
Sorry to reply so late. I delete them successfully by changing the finalizers while deleting workflow crd. @leojeb as you mentioned, I did set workflow-level ArtifactGC to OnWorkflowDeletion in some workflows and found they cannot be deleted. after having been setted, even I recover this setting, it still caused this problem for some workflow, which similar to @hnougher. @juliev0 I'm very thankful for you to analyze and explain this problem
Hi @juliev0 . Regarding "Pods that are used to delete the artifacts", I cannot see any evidence of this happening for me at all.
I do see that the workflow status is getting updated to show it has processed it, but that is all. The artifacts still exist as well.
strategiesProcessed:
OnWorkflowCompletion: true
OnWorkflowDeletion: true
I have attempted to use the kubelet/containerd node logs to locate if the pods wfcomp/wfdel ran at all, without success.
I have also tried adjusting the service account a little, with no difference to the observations above.
Hi @juliev0 . Regarding "Pods that are used to delete the artifacts", I cannot see any evidence of this happening for me at all.
I do see that the workflow status is getting updated to show it has processed it, but that is all. The artifacts still exist as well.
strategiesProcessed: OnWorkflowCompletion: true OnWorkflowDeletion: true
I have attempted to use the kubelet/containerd node logs to locate if the pods wfcomp/wfdel ran at all, without success.
I have also tried adjusting the service account a little, with no difference to the observations above.
Can you please attach your Workflow (and any WorkflowTemplate it may reference) plus your Workflow Controller log so I can look into it?
Let's use the example in the guide without the override. This has the issue. Uses the default repository key of the namespace.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: artifact-gc-
namespace: pstools-main
spec:
entrypoint: main
serviceAccountName: argo-executor # with or without this
artifactGC:
strategy: OnWorkflowDeletion
templates:
- name: main
container:
image: argoproj/argosay:v2
command:
- sh
- -c
args:
- |
echo "can throw this away" > /tmp/temporary-artifact.txt
outputs:
artifacts:
- name: temporary-artifact
path: /tmp/temporary-artifact.txt
And the log file around a single run of this workflow with deletion. argo-workflows-controller-58985bbf6-b2g82.log
Also, new learning, if I edit the workflow yaml that is supposed to be already deleted to remove the "finalizer", all associated pods disappear in less than a second. I use K8s Lens viewing it real time. The S3 artifacts still exist though.
argo-workflows-controller-58985bbf6-b2g82.log
Thanks for sharing that. One question: I see you are using the "default repository key of the namespace" and that appears to be the main difference from the example, right? If you specify a key instead does it work for you? If that's the case, can you help me reproduce? are you using the artifact-repositories
configmap, and which keys there are you specifying?
I'm looking at your log file and I see that a pod was started to perform ArtifactGC:
time="2023-01-10T06:45:13.963Z" level=info msg="creating pod to delete artifacts: artifact-gc-gknlr-artgc-wfdel-2166136261" namespace=pstools-main strategy=OnWorkflowDeletion workflow=artifact-gc-gknlr
time="2023-01-10T06:45:13.972Z" level=info msg="Create pods 201"
It appears that it failed:
time="2023-01-10T06:45:23.982Z" level=info msg="reconciling artifact-gc pod" message= namespace=pstools-main phase=Failed pod=artifact-gc-gknlr-artgc-wfdel-2166136261 workflow=artifact-gc-gknlr
Do you not see a Pod with that name? Can you do kubectl logs
on it to see what it says? (also, your WorkflowStatus should show a Condition with an error message from that log)
Tell you what, finding a log for a pod that never really run is very hard, since wildcards do not work. And it appears kube deletes the log for the pod within a minute or so, making it a race to locate it. But I did end up eyeing it in /var/log/containers of a node before it disappeared. The wfdel pod tried to start as the "default" service account, which doesn't have permission for listing workflowartifactgctasks.
Workflow conditions show nothing.
Duplicated the "serviceAccountName" into the "artifactGC" section, and it works. I did not realise it was not inheriting the workflow's service account.
Issue 1: Workflows interface does not show errors encountered during GC. Issue 2: I think the GC should use the workflow's service account by default.
Tell you what, finding a log for a pod that never really run is very hard, since wildcards do not work. And it appears kube deletes the log for the pod within a minute or so, making it a race to locate it. But I did end up eyeing it in /var/log/containers of a node before it disappeared. The wfdel pod tried to start as the "default" service account, which doesn't have permission for listing workflowartifactgctasks.
Workflow conditions show nothing.
Duplicated the "serviceAccountName" into the "artifactGC" section, and it works. I did not realise it was not inheriting the workflow's service account.
Issue 1: Workflows interface does not show errors encountered during GC. Issue 2: I think the GC should use the workflow's service account by default.
Which version of Argo Workflows are you running? v3.4.4?
Well... it was Bitnami Helm chart 5.1.0. Then just now I noticed it was using "workflows-server" image 3.4.4 and "workflows-controller" 3.4.3. Update chart to 5.1.1, and the controller is now 3.4.4.
Run the test again. Now I can see the pod being kept open and condition populated on workflow. Issue 1 fixed by silly release mistake...
Well... it was Bitnami Helm chart 5.1.0. Then just now I noticed it was using "workflows-server" image 3.4.4 and "workflows-controller" 3.4.3. Update chart to 5.1.1, and the controller is now 3.4.4.
Run the test again. Now I can see the pod being kept open and condition populated on workflow. Issue 1 fixed by silly release mistake...
Great. I was hoping for that.
As for your issue 2, I see where you're coming from. Currently, the ArtifactGC ServiceAccount is specified on the artifact level and on the Workflow level, where the artifact level can override the Workflow level. If we were to add in the back up of using the Workflow level ServiceAccount then we should probably also have a back up of using the regular template-level ServiceAccount. If we have all 4, then what would be the order of precedence? Maybe:
There is still something wrong here. I have set the service account to the global defaults, which works for the example case but not my complex workflow. The global default was added after adding GC to every workflow everywhere and not working. And it is the GC pod not being created again.
I cannot share my set of workflows, so I will have to work out what is going on to make a simplified case. The general structure that I suspect the issue is inside:
I hope that makes sense.
There is still something wrong here. I have set the service account to the global defaults, which works for the example case but not my complex workflow. The global default was added after adding GC to every workflow everywhere and not working. And it is the GC pod not being created again.
I cannot share my set of workflows, so I will have to work out what is going on to make a simplified case. The general structure that I suspect the issue is inside:
- WorkflowTemplate A Template A DAG calls WorkflowTemplate A Template B one or more times with different arguments.
- WorkflowTemplate A Template B DAG calls WorkflowTemplate B Template A which creates the artifact (templateRef).
- WorkflowTemplate A Template B DAG calls WorkflowTemplate C Template A which uses the artifact (templateRef).
I hope that makes sense.
Do you have any Workflow Controller log you can provide? and do you see any "Condition" on the Workflow Status?
If you inspect the Workflow Status, you should see that the nodes each have a status - what we should garbage collect is any node in there with an output artifacts whose GC strategy is set.
@hnougher Where is the ArtifactGC Strategy set in your example?
I'm using cronworkflows with artifacts stored in min.io. Old workflow runs for one of the cronworkflows does not get deleted when it should (after 24 hours). The workflows are stuck in a "pending deletion" state. I can clean them by removing the artifactGC finalizer from the manifest but I should need to do this. I have another cron workflow that is very similar in the same namespace that works properly. I've tried deleting the troublesome cronworkflow and recreating to no avail. i've also created both workflows in another namespace with same result. They both use the same serviceAccount.
it did cleanup old pods when I told it too, which I recently disabled so I can easily grab logs.
v3.4.4
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: photon
spec:
imagePullSecrets:
- name: dockerconfigjson-github-com
entrypoint: generate-db
templates:
- name: generate-db
inputs:
parameters:
- name: photon-db-src-path
- name: dest-bucket
- name: dest-key
outputs:
artifacts:
- name: photon-db
path: "{{inputs.parameters.photon-db-src-path}}"
s3:
bucket: "{{inputs.parameters.dest-bucket}}"
key: "{{inputs.parameters.dest-key}}"
endpoint: minio.techlabor.org:9000
insecure: true
accessKeySecret:
name: argo-workflow-artifact-minio-creds
key: accessKey
secretKeySecret:
name: argo-workflow-artifact-minio-creds
key: secretKey
artifactGC:
strategy: Never
container:
image: ghcr.io/bikehopper/photon-db-nominatim-importer:v2.0.0
command: ["/usr/app/build.sh"]
resources:
requests:
memory: "3Gi"
cpu: "3000m"
limits:
memory: "6Gi"
cpu: "4000m"
envFrom:
- secretRef:
name: minio-photon
env:
- name: MINIO_HOST
value: http://minio.techlabor.org:9000
- name: NOMINATIM_PASSWORD
valueFrom:
secretKeyRef:
name: nominatim-db
key: password
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
---
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: build-photon-db
spec:
schedule: "0 2,14 * * *"
concurrencyPolicy: "Replace"
startingDeadlineSeconds: 0
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
workflowSpec:
imagePullSecrets:
- name: dockerconfigjson-github-com
entrypoint: build-photon-db
artifactGC:
strategy: OnWorkflowDeletion
templates:
- name: build-photon-db
steps:
- - name: photon
templateRef:
name: photon
template: generate-db
arguments:
parameters:
- name: photon-db-src-path
value: /usr/app/photon_data
- name: dest-bucket
value: photon-staging
- name: dest-key
value: /elasticsearch/photon_data.tgz
Logs from the workflow controller
time="2023-01-28T04:42:10.938Z" level=info msg="Enforcing history limit for 'build-graph-cache'" namespace=staging workflow=build-graph-cache
time="2023-01-28T04:42:10.938Z" level=info msg="Enforcing history limit for 'build-photon-db'" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.952Z" level=info msg="Delete workflows 200"
time="2023-01-28T04:42:10.952Z" level=info msg="Deleted Workflow 'build-photon-db-1674828000' due to CronWorkflow 'build-photon-db' history limit" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.961Z" level=info msg="Delete workflows 200"
time="2023-01-28T04:42:10.961Z" level=info msg="Deleted Workflow 'build-photon-db-1674784800' due to CronWorkflow 'build-photon-db' history limit" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.966Z" level=info msg="Delete workflows 200"
time="2023-01-28T04:42:10.966Z" level=info msg="Deleted Workflow 'build-photon-db-swb7d' due to CronWorkflow 'build-photon-db' history limit" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.970Z" level=info msg="Delete workflows 200"
time="2023-01-28T04:42:10.970Z" level=info msg="Deleted Workflow 'build-photon-db-5q96d' due to CronWorkflow 'build-photon-db' history limit" namespace=staging workflow=build-photon-db
time="2023-01-28T04:42:10.974Z" level=info msg="Delete workflows 200"
...
Logs from in your workflow's wait container
time="2023-01-27T02:00:02.272Z" level=info msg="Starting Workflow Executor" version=v3.4.4
time="2023-01-27T02:00:02.274Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-01-27T02:00:02.274Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=staging podName=build-photon-db-1674784800-generate-db-1890044021 template="{\"name\":\"generate-db\",\"inputs\":{\"parameters\":[{\"name\":\"photon-db-src-path\",\"value\":\"/usr/app/photon_data\"},{\"name\":\"dest-bucket\",\"value\":\"photon-staging\"},{\"name\":\"dest-key\",\"value\":\"/elasticsearch/photon_data.tgz\"}]},\"outputs\":{\"artifacts\":[{\"name\":\"photon-db\",\"path\":\"/usr/app/photon_data\",\"s3\":{\"endpoint\":\"minio.techlabor.org:9000\",\"bucket\":\"photon-staging\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"argo-workflow-artifact-minio-creds\",\"key\":\"accessKey\"},\"secretKeySecret\":{\"name\":\"argo-workflow-artifact-minio-creds\",\"key\":\"secretKey\"},\"key\":\"/elasticsearch/photon_data.tgz\"},\"artifactGC\":{\"strategy\":\"Never\"}}]},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"ghcr.io/bikehopper/photon-db-nominatim-importer:v2.0.0\",\"command\":[\"/usr/app/build.sh\"],\"envFrom\":[{\"secretRef\":{\"name\":\"minio-photon\"}}],\"env\":[{\"name\":\"MINIO_HOST\",\"value\":\"http://minio.techlabor.org:9000\"},{\"name\":\"NOMINATIM_PASSWORD\",\"valueFrom\":{\"secretKeyRef\":{\"name\":\"nominatim-db\",\"key\":\"password\"}}},{\"name\":\"POD_NAMESPACE\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.namespace\"}}}],\"resources\":{\"limits\":{\"cpu\":\"4\",\"memory\":\"6Gi\"},\"requests\":{\"cpu\":\"3\",\"memory\":\"3Gi\"}}}}" version="&Version{Version:v3.4.4,BuildDate:2022-11-29T16:49:53Z,GitCommit:3b2626ff900aff2424c086a51af5929fb0b2d7e5,GitTag:v3.4.4,GitTreeState:clean,GoVersion:go1.18.8,Compiler:gc,Platform:linux/amd64,}"
time="2023-01-27T02:00:02.274Z" level=info msg="Starting deadline monitor"
time="2023-01-27T02:05:02.274Z" level=info msg="Alloc=6127 TotalAlloc=12422 Sys=24274 NumGC=6 Goroutines=7"
time="2023-01-27T02:10:02.274Z" level=info msg="Alloc=6147 TotalAlloc=12529 Sys=24530 NumGC=8 Goroutines=7"
time="2023-01-27T02:12:53.423Z" level=info msg="Main container completed" error="<nil>"
time="2023-01-27T02:12:53.423Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-01-27T02:12:53.423Z" level=info msg="No output parameters"
time="2023-01-27T02:12:53.423Z" level=info msg="Saving output artifacts"
time="2023-01-27T02:12:53.424Z" level=info msg="Staging artifact: photon-db"
time="2023-01-27T02:12:53.424Z" level=info msg="Copying /usr/app/photon_data from container base image layer to /tmp/argo/outputs/artifacts/photon-db.tgz"
time="2023-01-27T02:12:53.424Z" level=info msg="/var/run/argo/outputs/artifacts/usr/app/photon_data.tgz -> /tmp/argo/outputs/artifacts/photon-db.tgz"
time="2023-01-27T02:12:53.637Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/photon-db.tgz, key: /elasticsearch/photon_data.tgz"
time="2023-01-27T02:12:53.637Z" level=info msg="Creating minio client using static credentials" endpoint="minio.techlabor.org:9000"
time="2023-01-27T02:12:53.637Z" level=info msg="Saving file to s3" bucket=photon-staging endpoint="minio.techlabor.org:9000" key=/elasticsearch/photon_data.tgz path=/tmp/argo/outputs/artifacts/photon-db.tgz
time="2023-01-27T02:13:01.132Z" level=info msg="Save artifact" artifactName=photon-db duration=7.494728595s error="<nil>" key=/elasticsearch/photon_data.tgz
time="2023-01-27T02:13:01.132Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/artifacts/photon-db.tgz
time="2023-01-27T02:13:01.132Z" level=info msg="Successfully saved file: /tmp/argo/outputs/artifacts/photon-db.tgz"
time="2023-01-27T02:13:01.149Z" level=info msg="Create workflowtaskresults 201"
time="2023-01-27T02:13:01.150Z" level=info msg="Deadline monitor stopped"
time="2023-01-27T02:13:01.150Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-01-27T02:13:01.150Z" level=info msg="Alloc=14950 TotalAlloc=64944 Sys=33234 NumGC=17 Goroutines=13"
Stream closed EOF for staging/build-photon-db-1674784800-generate-db-1890044021 (wait)
@Andykmcc A design decision was made to keep the Workflow around in the case of artifact GC failure. You should see the reason why your deletion failed to occur in various places:
kubectl logs
for that pod it should give you information on why it failed (those pods get left around until the Workflow is deleted)Should it be configurable that if GC fails the Workflow still gets deleted anyway?
I found workaround for now. By adding a nonsense step to the workflow that outputs an artifact file the workflows started to get cleaned up. My default artifactGC strategy is OnWorkflowDeletion
. the workflow in question only had one step which set its own step specific artifactGC strategy, never
. I speculated maybe that was creating an issue, so I added this extra "print-message" nonsense step to see if having a step that actually needed GC would change anything. It did.
you should see a pod whose name contains your workflow name + "-artgc-wfdel". If you do kubectl logs for that pod it should give you information on why it failed (those pods get left around until the Workflow is deleted)
The -artgc-wfdel
pod is cleaned up immediately so it is difficult to get the logs from it (my LMA stack is in progress). The workflow take 13-ish minutes to complete so iterating on this is slow. I'll try to reproduce this bug in a faster workflow so I can grab the container logs more easily. maybe that will shed light on what I explained above.
Should it be configurable that if GC fails the Workflow still gets deleted anyway?
I'm indifferent to this at the moment. Eventually I want it so if GC fails the workflow does not get clean up. then I'll see it hanging around in the UI/CLI/metrics and will know I need to take action to avoid filling up my artifact storage.
I found workaround for now. By adding a nonsense step to the workflow that outputs an artifact file the workflows started to get cleaned up. My default artifactGC strategy is
OnWorkflowDeletion
. the workflow in question only had one step which set its own step specific artifactGC strategy,never
. I speculated maybe that was creating an issue, so I added this extra "print-message" nonsense step to see if having a step that actually needed GC would change anything. It did.you should see a pod whose name contains your workflow name + "-artgc-wfdel". If you do kubectl logs for that pod it should give you information on why it failed (those pods get left around until the Workflow is deleted)
The
-artgc-wfdel
pod is cleaned up immediately so it is difficult to get the logs from it (my LMA stack is in progress). The workflow take 13-ish minutes to complete so iterating on this is slow. I'll try to reproduce this bug in a faster workflow so I can grab the container logs more easily. maybe that will shed light on what I explained above.Should it be configurable that if GC fails the Workflow still gets deleted anyway?
I'm indifferent to this at the moment. Eventually I want it so if GC fails the workflow does not get clean up. then I'll see it hanging around in the UI/CLI/metrics and will know I need to take action to avoid filling up my artifact storage.
Interesting finding. Thanks for attaching your Workflow. I'll try reproducing it.
@Andykmcc I think I see where in the code it's doing exactly what you say - I will fix it. Probably no need to do any more troubleshooting on your part. Thanks.
@Andykmcc this PR fixes the issue you were seeing where the Workflow is assumed to have ArtifactGC (and add the finalizer) if it's defined on the Workflow level but overridden as "Never" on the Artifact level
Hi @juliev0 , I had trouble finding time to get back to it.
I have been able to create a test case for my direct issue. The key item is that both the first and second are in different WorkflowTemplate. If "test" is a Workflow instead of a WorkflowTemplate, it will clean up fine. Hopefully it is the same cause, else this may need splitting to another issue.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: test
spec:
entrypoint: test
templates:
- name: test
dag:
tasks:
- name: collect-single
templateRef:
name: artifact-passing-test
template: whalesay
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: artifact-passing-test
spec:
templates:
- name: whalesay
container:
image: docker/whalesay:latest
command: [sh, -c]
args: ["cowsay hello world | tee /tmp/hello_world.txt"]
outputs:
artifacts:
- name: hello-art
path: /tmp/hello_world.txt
with Helm chart workflow template of
workflowDefaults: |
spec:
artifactGC:
strategy: OnWorkflowDeletion
serviceAccountName: argo-executor
serviceAccountName: argo-executor
And s3 details in the namespace defaults.
Hi @juliev0 , I had trouble finding time to get back to it.
I have been able to create a test case for my direct issue. The key item is that both the first and second are in different WorkflowTemplate. If "test" is a Workflow instead of a WorkflowTemplate, it will clean up fine. Hopefully it is the same cause, else this may need splitting to another issue.
- Run "test" from the GUI.
- Press delete once finished.
- The GC pod never appears or runs.
- Workflow marked as having processed GC with no error, but not deleting either artifact or workflow.
apiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: test spec: entrypoint: test templates: - name: test dag: tasks: - name: collect-single templateRef: name: artifact-passing-test template: whalesay --- apiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: artifact-passing-test spec: templates: - name: whalesay container: image: docker/whalesay:latest command: [sh, -c] args: ["cowsay hello world | tee /tmp/hello_world.txt"] outputs: artifacts: - name: hello-art path: /tmp/hello_world.txt
with Helm chart workflow template of
workflowDefaults: | spec: artifactGC: strategy: OnWorkflowDeletion serviceAccountName: argo-executor serviceAccountName: argo-executor
And s3 details in the namespace defaults.
v3.4.5 just got released. Did you try with that?
Hi @juliev0 . No sign of 3.4.5 under the Bitnami Helm chart at this time.
Latest Bitnami Helm chart 19.1.7 contains Controller 3.4.4, Exec 3.4.4 and Server 3.4.5. My test case still fails.
Hmm. I’m not sure who produces that. So, mainly it’s the Controller that needs to be 3.4.5.
I submitted a ticket to Bitnami https://github.com/bitnami/charts/issues/14805#issue-1577266898 They have remade the Helm with all the latest version.
And I can now confirm that both my test case and live workflows are all cleaning up perfectly! Such a relief. Thank you for solving it.
I am experiencing the same issue and cannot delete workflows with artifactGC set in 3.4.5. Any other ideas?
Having re-read this thread, it would appear that I probably need to review the logs in wfdel
containers.
However, I currently have ~30 workflows that will not delete, even using the --force
option with the Argo CLI. Will I have to manually remove the finalizer from each one?
I was able to delete the workflows by removing the finalizer
entry manually. Note that the --force
option did not work.
Below are the logs from my wfdel
and wfcomp
pods. I see no apparent issues, can anyone point me in the right direction to understand the issue here?
kubectl logs workflow-z5pz2-artgc-wfdel-1494256823
time="2023-03-23T18:36:21.278Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log"
time="2023-03-23T18:36:21.278Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.493Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log
time="2023-03-23T18:36:21.611Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log"
time="2023-03-23T18:36:21.611Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.667Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log
kubectl logs workflow-z5pz2-artgc-wfcomp-1494256823
time="2023-03-23T18:36:21.146Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz"
time="2023-03-23T18:36:21.146Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-03-23T18:36:21.429Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz
I was able to delete the workflows by removing the
finalizer
entry manually. Note that the--force
option did not work.Below are the logs from my
wfdel
andwfcomp
pods. I see no apparent issues, can anyone point me in the right direction to understand the issue here?kubectl logs workflow-z5pz2-artgc-wfdel-1494256823 time="2023-03-23T18:36:21.278Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log" time="2023-03-23T18:36:21.278Z" level=info msg="Creating minio client using AWS SDK credentials" time="2023-03-23T18:36:21.493Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log time="2023-03-23T18:36:21.611Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log" time="2023-03-23T18:36:21.611Z" level=info msg="Creating minio client using AWS SDK credentials" time="2023-03-23T18:36:21.667Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log kubectl logs workflow-z5pz2-artgc-wfcomp-1494256823 time="2023-03-23T18:36:21.146Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz" time="2023-03-23T18:36:21.146Z" level=info msg="Creating minio client using AWS SDK credentials" time="2023-03-23T18:36:21.429Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz
Yeah, I don't see an error in those logs. Do you see any "Condition" added to the Workflow Status? If it's repeatable with a given Workflow maybe you can write up a new bug and direct me to it?
I am getting the same issue as everyone else here. However, I am trying to replicate the same issue with a simpler workflow and can't find what is triggering this issue. What I noticed is that when I am deleting a healthy workflow, argo would spawn a new pod which takes care of deleting artifact and then workflow gets deleted.
With unhealthy worklow I don't see the same, it doesn't spawn any pod and workflow doesn't get deleted.
By unhealthy I mean that it shows that it has error "Artifact garbage collection failed".
So, I tried one thing. I ran this worklow:
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: start-deployment
spec:
entrypoint: init-dag
artifactRepositoryRef:
configMap: workflow-controller-configmap
key: artifactRepository
archiveLogs: true
podGC:
strategy: OnWorkflowCompletion
artifactGC:
strategy: OnWorkflowDeletion
templates:
- name: init-dag
dag:
tasks:
- name: start-dp
template: start-deployment
outputs:
artifacts:
- name: index
from: "{{tasks.start-dp.outputs.artifacts.index}}"
- name: start-deployment
outputs:
artifacts:
- name: index
path: /tmp/index.html
container:
name: ''
image: alpine/git
command:
- sh
- '-c'
args:
- >-
git clone https://github.com/octocat/Spoon-Knife.git
&& cd Spoon-Knife
&& cat index.html > /tmp/index.html
Then I manually deleted artifacts and then I tried to delete the workflow and it didn't get deleted. It means that if Argo doesn't find corresponding artifact for the workflow, the workflow doesn't get deleted whatsoever. I think this shouldn't be the case and workflow should be deleted even if artifact doesn't exist anymore.
Note: I save my artifacts using blobNameFormat: "{{workflow.name}}/{{pod.name}}"
I was able to delete the workflows by removing the
finalizer
entry manually. Note that the--force
option did not work. Below are the logs from mywfdel
andwfcomp
pods. I see no apparent issues, can anyone point me in the right direction to understand the issue here?kubectl logs workflow-z5pz2-artgc-wfdel-1494256823 time="2023-03-23T18:36:21.278Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log" time="2023-03-23T18:36:21.278Z" level=info msg="Creating minio client using AWS SDK credentials" time="2023-03-23T18:36:21.493Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-save-to-s3-1433603872/main.log time="2023-03-23T18:36:21.611Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log" time="2023-03-23T18:36:21.611Z" level=info msg="Creating minio client using AWS SDK credentials" time="2023-03-23T18:36:21.667Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/main.log kubectl logs workflow-z5pz2-artgc-wfcomp-1494256823 time="2023-03-23T18:36:21.146Z" level=info msg="S3 Delete artifact: key: workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz" time="2023-03-23T18:36:21.146Z" level=info msg="Creating minio client using AWS SDK credentials" time="2023-03-23T18:36:21.429Z" level=info msg="Deleting object from s3" bucket=argo-workflows-artifacts endpoint=s3.***.amazonaws.com key=workflow-z5pz2/workflow-z5pz2-create-dataset-1776875532/dataset.tgz
Yeah, I don't see an error in those logs. Do you see any "Condition" added to the Workflow Status? If it's repeatable with a given Workflow maybe you can write up a new bug and direct me to it?
Thanks for taking a look @juliev0, I submitted a new bug in #10840
I am getting the same issue as everyone else here. However, I am trying to replicate the same issue with a simpler workflow and can't find what is triggering this issue. What I noticed is that when I am deleting a healthy workflow, argo would spawn a new pod which takes care of deleting artifact and then workflow gets deleted.
With unhealthy worklow I don't see the same, it doesn't spawn any pod and workflow doesn't get deleted.
By unhealthy I mean that it shows that it has error "Artifact garbage collection failed".
So, I tried one thing. I ran this worklow:
apiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: start-deployment spec: entrypoint: init-dag artifactRepositoryRef: configMap: workflow-controller-configmap key: artifactRepository archiveLogs: true podGC: strategy: OnWorkflowCompletion artifactGC: strategy: OnWorkflowDeletion templates: - name: init-dag dag: tasks: - name: start-dp template: start-deployment outputs: artifacts: - name: index from: "{{tasks.start-dp.outputs.artifacts.index}}" - name: start-deployment outputs: artifacts: - name: index path: /tmp/index.html container: name: '' image: alpine/git command: - sh - '-c' args: - >- git clone https://github.com/octocat/Spoon-Knife.git && cd Spoon-Knife && cat index.html > /tmp/index.html
Then I manually deleted artifacts and then I tried to delete the workflow and it didn't get deleted. It means that if Argo doesn't find corresponding artifact for the workflow, the workflow doesn't get deleted whatsoever. I think this shouldn't be the case and workflow should be deleted even if artifact doesn't exist anymore.
Note: I save my artifacts using blobNameFormat: "{{workflow.name}}/{{pod.name}}"
There is a Finalizer on the Workflow which prevents Workflow deletion until all artifacts have been deleted. Did you inspect your Workflow's Status (kubectl get workflow <name> -o yaml
) to see if there was more information about why the ArtifactGC failed? Also, you can look at the Workflow Controller log file and the GC Pod logs. There is now this enhancement issue which you might be interested in that I have submitted a PR for.
Pre-requisites
:latest
What happened/what you expected to happen?
I enable Artifacts with Azure Storage Blob as artifactory. And I wrote workflow with artifacts for passing parameters. However, I found some of these workflows cannot be deleted. Whatever from Argo UI or command argo delete. Even I update Argo Workflow, all workflow records are clean except these workflows. I don't know how to delete them.
Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container