argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

`ownerReference` is not validated, causing unhandled error across namespaces #13391

Open Aransh opened 4 months ago

Aransh commented 4 months ago

Pre-requisites

What happened? What did you expect to happen?

When deploying a resource as part of a workflow with setOwnerReference enabled, Argo Workflows does not validated the generated ownerReference. As I had to learn the hard way, in Kubernetes, "Cross-namespace owner references are disallowed by design" (See https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/). So If a Workflow creates a resource in another namespace, and has setOwnerReference enabled, it will create an invalid ownerReference, causing Kubernetes' garbage collector to immediately remove the resource, causing the workflow to fail with no explanation.

Optimally, I would expect Argo Workflows to know that and print logs about it as part of the step.

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: argo-workflows
  name: test-argo-workflow
spec:
  entrypoint: main
  serviceAccountName: argo-workflow
  templates:
  - name: main
    steps:
      - - name: test
          template: test
  - name: test
    resource:
      action: create
      setOwnerReference: true
      successCondition: status.succeeded > 0
      failureCondition: status.failed > 3
      manifest: |
        apiVersion: batch/v1
        kind: Job
        metadata:
          generateName: pi-job-
          namespace: default
        spec:
          template:
            metadata:
              name: pi
            spec:
              containers:
              - name: pi
                image: perl
                command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
              restartPolicy: Never
          backoffLimit: 4

Logs from the workflow controller

time="2024-07-24T08:38:03 UTC" level=info msg="Starting Workflow Executor" version=v3.5.8
time="2024-07-24T08:38:03 UTC" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-07-24T08:38:03 UTC" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo-workflows podName=test-argo-workflow-test-2378761289 templateName=test version="&Version{Version:v3.5.8,BuildDate:2024-06-18T03:43:17Z,GitCommit:3bb637c0261f8c08d4346175bb8b1024719a1f11,GitTag:v3.5.8,GitTreeState:clean,GoVersion:go1.21.10,Compiler:gc,Platform:linux/amd64,}"
time="2024-07-24T08:38:03 UTC" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2024-07-24T08:38:03 UTC" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
time="2024-07-24T08:38:04 UTC" level=info msg="Resource: default/job.batch/pi-job-2kdlw. SelfLink: apis/batch/v1/namespaces/default/jobs/pi-job-2kdlw"
time="2024-07-24T08:38:04 UTC" level=info msg="Waiting for conditions: status.succeeded>0"
time="2024-07-24T08:38:04 UTC" level=info msg="Failing for conditions: status.failed>3"
time="2024-07-24T08:38:04 UTC" level=info msg="failure condition '{status.failed gt [3]}' evaluated false"
time="2024-07-24T08:38:04 UTC" level=info msg="success condition '{status.succeeded gt [0]}' evaluated false"
time="2024-07-24T08:38:04 UTC" level=info msg="0/1 success conditions matched"
time="2024-07-24T08:38:04 UTC" level=info msg="Waiting for resource job.batch/pi-job-2kdlw in namespace default resulted in retryable error: Neither success condition nor the failure condition has been matched. Retrying..."
time="2024-07-24T08:38:09 UTC" level=warning msg="Non-transient error: The resource has been deleted while its status was still being checked. Will not be retried: jobs.batch \"pi-job-2kdlw\" not found"
time="2024-07-24T08:38:09 UTC" level=warning msg="Waiting for resource job.batch/pi-job-2kdlw in namespace default resulted in non-retryable error: The resource has been deleted while its status was still being checked. Will not be retried: jobs.batch \"pi-job-2kdlw\" not found"
time="2024-07-24T08:38:09 UTC" level=warning msg="Waiting for resource job.batch/pi-job-2kdlw resulted in error The resource has been deleted while its status was still being checked. Will not be retried: jobs.batch \"pi-job-2kdlw\" not found"
time="2024-07-24T08:38:09 UTC" level=error msg="executor error: The resource has been deleted while its status was still being checked. Will not be retried: jobs.batch \"pi-job-2kdlw\" not found"
time="2024-07-24T08:38:09 UTC" level=fatal msg="The resource has been deleted while its status was still being checked. Will not be retried: jobs.batch \"pi-job-2kdlw\" not found"
time="2024-07-24T08:38:09 UTC" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1

Logs from in your workflow's wait container

Empty
agilgur5 commented 4 months ago

and has setOwnerReference enabled, it will create an invalid ownerReference, causing Kubernetes' garbage collector to immediately remove the resource, causing the workflow to fail with no explanation.

I'm surprised that k8s itself doesn't fail validation on this resource. The Controller does not currently validate the resource at all (it doesn't necessarily have schemas of all possible resources to do so) and leaves that to k8s. While this specific case of namespacing could be handled within Argo, the general case of k8s not validating sounds like an upstream issue

tooptoop4 commented 2 weeks ago

maybe just need a docs entry about: don't use setOwnerReference else you may see error "The resource has been deleted while its status was still being checked."

Aransh commented 2 weeks ago

maybe just need a docs entry about: don't use setOwnerReference else you may see error "The resource has been deleted while its status was still being checked."

Could also work, would've saved me a lot of time