argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.88k stars 3.17k forks source link

ArtifactGC not working on OpenShift #12316

Open remicres opened 9 months ago

remicres commented 9 months ago

Pre-requisites

What happened/what did you expect to happen?

Hi,

Artifacts repository works fine, except that ArtifactGC does not work. Besides, workflows status are "succeed" but they have to be deleted manually setting the metadata.finalizers to [], else the deletion deadlocks.

The server logs mention some cluster permissions issue (see logs from workflow controler).

I am new to argo workflow, and I could be wrong here, but it would not be the first time I encounter such permission-related issues on openshift

Version

v3.4.13

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

https://github.com/argoproj/argo-workflows/blob/main/examples/output-artifact-gcs.yaml

Logs from the workflow controller

time="2023-12-03T16:24:00.301Z" level=info msg="Processing workflow" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:00.307Z" level=info msg="adding artifact GC finalizer" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:00.307Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:00.307Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:00.308Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:00.308Z" level=info msg="Pod node artifact-gc-8s92m initialized Pending" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:00.348Z" level=info msg="Created pod: artifact-gc-8s92m (artifact-gc-8s92m)" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:00.348Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:00.348Z" level=info msg=reconcileAgentPod namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:00.360Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=303305273 workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.349Z" level=info msg="Processing workflow" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.350Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.350Z" level=info msg="task-result changed" namespace=argo nodeID=artifact-gc-8s92m workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.350Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=artifact-gc-8s92m old.message= old.phase=Pending old.progress=0/1 workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.350Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.350Z" level=info msg=reconcileAgentPod namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.350Z" level=info msg="Updated phase Running -> Succeeded" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.350Z" level=info msg="Marking workflow completed" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.356Z" level=info msg="cleaning up pod" action=deletePod key=argo/artifact-gc-8s92m-1340600742-agent/deletePod
time="2023-12-03T16:24:10.361Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=303305385 workflow=artifact-gc-8s92m
time="2023-12-03T16:24:10.377Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/artifact-gc-8s92m/labelPodCompleted
time="2023-12-03T16:24:20.362Z" level=info msg="Processing workflow" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:20.362Z" level=info msg="Creating Artifact GC Task artifact-gc-8s92m-artgc-wfcomp-2632535418-0" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:20.371Z" level=info msg="creating pod to delete artifacts: artifact-gc-8s92m-artgc-wfcomp-2632535418" namespace=argo strategy=OnWorkflowCompletion workflow=artifact-gc-8s92m
time="2023-12-03T16:24:20.375Z" level=error msg="failed to GC artifacts" error="failed to create pod: pods \"artifact-gc-8s92m-artgc-wfcomp-2632535418\" is forbidden: unable to validate against any security context constraint: [provider \"anyuid\": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .containers[0].runAsUser: Invalid value: 8737: must be in the ranges: [1000710000, 1000719999], provider \"restricted\": Forbidden: not usable by user or serviceaccount, provider \"nonroot-v2\": Forbidden: not usable by user or serviceaccount, provider \"nonroot\": Forbidden: not usable by user or serviceaccount, provider \"hostmount-anyuid\": Forbidden: not usable by user or serviceaccount, provider \"machine-api-termination-handler\": Forbidden: not usable by user or serviceaccount, provider \"hostnetwork-v2\": Forbidden: not usable by user or serviceaccount, provider \"hostnetwork\": Forbidden: not usable by user or serviceaccount, provider \"hostaccess\": Forbidden: not usable by user or serviceaccount, provider \"node-exporter\": Forbidden: not usable by user or serviceaccount, provider \"privileged\": Forbidden: not usable by user or serviceaccount]" namespace=argo workflow=artifact-gc-8s92m
time="2023-12-03T16:24:20.384Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=303305476 workflow=artifact-gc-8s92m
time="2023-12-03T16:44:20.362Z" level=info msg="Processing workflow" namespace=argo workflow=artifact-gc-8s92m

Logs from in your workflow's wait container

time="2023-12-03T16:24:04.866Z" level=info msg="/var/run/argo/outputs/artifacts/tmp/on-deletion.txt.tgz -> /tmp/argo/outputs/artifacts/on-deletion.tgz"
time="2023-12-03T16:24:04.867Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/on-deletion.tgz, key: on-deletion.txt"
time="2023-12-03T16:24:04.867Z" level=info msg="Creating minio client using static credentials" endpoint=s3-data.meso.umontpellier.fr
time="2023-12-03T16:24:04.873Z" level=info msg="Saving file to s3" bucket=process-artifacts endpoint=s3-data.meso.umontpellier.fr key=on-deletion.txt path=/tmp/argo/outputs/artifacts/on-deletion.tgz
time="2023-12-03T16:24:05.198Z" level=info msg="Save artifact" artifactName=on-deletion duration=331.707731ms error="<nil>" key=on-deletion.txt
time="2023-12-03T16:24:05.198Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/artifacts/on-deletion.tgz
time="2023-12-03T16:24:05.198Z" level=info msg="Successfully saved file: /tmp/argo/outputs/artifacts/on-deletion.tgz"
time="2023-12-03T16:24:05.227Z" level=info msg="Alloc=7242 TotalAlloc=18010 Sys=32893 NumGC=5 Goroutines=12"
time="2023-12-03T16:24:05.227Z" level=info msg="Deadline monitor stopped"
time="2023-12-03T16:24:05.228Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
remicres commented 9 months ago

Looking in the artifact gc code, I saw this line here

...
RunAsUser:                pointer.Int64Ptr(8737),
...

And if we take a look at the error message reported in the issue, it looks like OpenShift wants an user id must in a certain range:

"failed to GC artifacts" 
error="failed to create pod: 
  pods "artifact-gc-nn4dj-artgc-wfcomp-3200369636" is forbidden: 
    unable to validate against any security context constraint: [
      provider "anyuid": Forbidden: not usable by user or serviceaccount, 
      provider restricted-v2: .containers[0].runAsUser: Invalid value: 8737: must be in the ranges: [1000710000, 1000719999], 
...]" namespace=argo workflow=artifact-gc-nn4dj

Could the GC pod UID be the cause?

Joibel commented 9 months ago

You should be able to use the podSpecPatch as part of https://argoproj.github.io/argo-workflows/fields/#workflowlevelartifactgc to modify this and prove your idea.

remicres commented 8 months ago

Thanks, I'll try this asap. First I need to update Argo (I have version 3.4.13, unfortunately for me podSpecPatch and forceFinalizerRemoval come with 3.5.0). I will keep you updated

remicres commented 8 months ago

After upgrading Argo-Workflows to 3.5.2, and applying podSpecPatch like this:

...
  artifactGC:
    strategy: OnWorkflowDeletion
    forceFinalizerRemoval: true
    podSpecPatch: '{"containers":[{"name":"main", "securityContext":{"runAsUser":1000710000}}]}'
...

I still can't make GC working, and I still have to delete manually the finalizer (forceFinalizerRemoval does not seem to work).

The error is different though:

(controller logs)

...
time="2024-01-02T19:23:23.578Z" level=info msg="Creating Artifact GC Task myarticho-artgc-wfcomp-2166136261-0" namespace=argo workflow=myarticho
time="2024-01-02T19:23:23.610Z" level=info msg="creating pod to delete artifacts: myarticho-artgc-wfcomp-2166136261" namespace=argo strategy=OnWorkflowCompletion workflow=myarticho
time="2024-01-02T19:23:23.614Z" level=error msg="failed to GC artifacts" error="failed to create pod: pods \"myarticho-artgc-wfcomp-2166136261\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>" namespace=argo workflow=myarticho
...

I have checked the ClusterRole, and I have the following (which looks fine?):

rules:
- apiGroups:
  - ""
  resources:
  - pods
  - pods/exec
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - persistentvolumeclaims
  - persistentvolumeclaims/finalizers
  verbs:
  - create
  - update
  - delete
  - get
- apiGroups:
  - argoproj.io
  resources:
  - workflows
  - workflows/finalizers
  - workflowtasksets
  - workflowtasksets/finalizers
  - workflowartifactgctasks
  verbs:
  - get
  - list
  - watch
  - update
  - patch
  - delete
  - create
- apiGroups:
  - argoproj.io
  resources:
  - workflowtemplates
  - workflowtemplates/finalizers
  - clusterworkflowtemplates
  - clusterworkflowtemplates/finalizers
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - argoproj.io
  resources:
  - workflowtaskresults
  verbs:
  - list
  - watch
  - deletecollection
- apiGroups:
  - ""
  resources:
  - serviceaccounts
  verbs:
  - get
  - list
- apiGroups:
  - argoproj.io
  resources:
  - cronworkflows
  - cronworkflows/finalizers
  verbs:
  - get
  - list
  - watch
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - create
  - get
  - delete

Now I really don't see why