argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.08k stars 3.2k forks source link

`retentionPolicy.completed` actually means `succeeded` #12135

Open devstewart opened 1 year ago

devstewart commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

Submitted workflow which is expected to fail and it does, but the workflow gets deleted immediately.

image

Within seconds I get workflow gone.

I expected the failed workflows to remain as they did in earlier versions. I do not have the workflow-controller-configmap configured to do this. Relevant configuration:

completed: 400
spec:
  activeDeadlineSeconds: 86400
  podGC:
    strategy: OnPodSuccess

Version

v3.5.0

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: argo
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
spec:
  entrypoint: helloworld
  templates:
  - name: helloworld
    container:
      image: ghcr.io/openzipkin/alpine:3.18.2
      command: ["/bin/sh"]
      args: ["-c", "exit 1"]

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2023-11-03T20:38:15.027Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.032Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.033Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.033Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.033Z" level=info msg="Pod node hello-world-59xpp initialized Pending" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.057Z" level=info msg="Created pod: hello-world-59xpp (hello-world-59xpp)" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.057Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.057Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:15.070Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1703572159 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.061Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.062Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.062Z" level=info msg="Pod failed: Error (exit code 1)" displayName=hello-world-59xpp namespace=argo pod=hello-world-59xpp templateName=helloworld workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=hello-world-59xpp old.message= old.phase=Pending old.progress=0/1 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="Updated message  -> Error (exit code 1)" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.063Z" level=info msg="Marking workflow completed" namespace=argo workflow=hello-world-59xpp
time="2023-11-03T20:38:25.068Z" level=info msg="cleaning up pod" action=deletePod key=argo/hello-world-59xpp-1340600742-agent/deletePod
time="2023-11-03T20:38:25.071Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=1703572339 workflow=hello-world-59xpp
time="2023-11-03T20:38:25.073Z" level=info msg="Queueing Failed workflow argo/hello-world-59xpp for delete due to max rention(0 workflows)"
time="2023-11-03T20:38:25.073Z" level=info msg="Deleting garbage collected workflow 'argo/hello-world-59xpp'"
time="2023-11-03T20:38:25.081Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/hello-world-59xpp/labelPodCompleted
time="2023-11-03T20:38:25.081Z" level=info msg="Successfully request 'argo/hello-world-59xpp' to be deleted"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

time="2023-11-03T20:38:19.525Z" level=info msg="Creating minio client using static credentials" endpoint=deap-api.decloud.xxx.com
time="2023-11-03T20:38:19.525Z" level=info msg="Saving file to s3" bucket=deap-argo-prod endpoint=deap-api.decloud.xxx.com key="dev\\ /2023\\ /11\\ /03\\ /hello-world-59xpp\\ /hello-world-59xpp\"/main.log" path=/tmp/argo/outputs/logs/main.log
time="2023-11-03T20:38:19.594Z" level=info msg="Save artifact" artifactName=main-logs duration=94.016454ms error="<nil>" key="dev\\ /2023\\ /11\\ /03\\ /hello-world-59xpp\\ /hello-world-59xpp\"/main.log"
time="2023-11-03T20:38:19.594Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-11-03T20:38:19.594Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-11-03T20:38:19.604Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:default\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2023-11-03T20:38:19.605Z" level=warning msg="Non-transient error: pods \"hello-world-59xpp\" is forbidden: User \"system:serviceaccount:argo:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"argo\""
time="2023-11-03T20:38:19.606Z" level=error msg="executor error: pods \"hello-world-59xpp\" is forbidden: User \"system:serviceaccount:argo:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"argo\""
time="2023-11-03T20:38:19.606Z" level=info msg="Alloc=12580 TotalAlloc=18686 Sys=30309 NumGC=4 Goroutines=10"
time="2023-11-03T20:38:19.606Z" level=fatal msg="pods \"hello-world-59xpp\" is forbidden: User \"system:serviceaccount:argo:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"argo\""
tooptoop4 commented 1 year ago

@devstewart do u have retentionpolicy block like https://github.com/argoproj/argo-workflows/blob/f88ae8885745d718dd1cebd9ca7ac63349133d1c/manifests/quick-start-minimal.yaml#L1448 ?

agilgur5 commented 1 year ago

Looks like this is a follow-up to #12082

@tooptoop4 indeed, per that discussion, they do have a retentionPolicy

devstewart commented 1 year ago

@devstewart do u have retentionpolicy block like

https://github.com/argoproj/argo-workflows/blob/f88ae8885745d718dd1cebd9ca7ac63349133d1c/manifests/quick-start-minimal.yaml#L1448

?

Yes, our retention policy is set to 400, but we only get about 100 - 200 new workflows a day and the failed workflows are deleted immediately.

devstewart commented 1 year ago

Here's the yaml for our workflow-controller-configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
  namespace: argo
data:
  artifactRepository: |-
    # archiveLogs will archive the main container logs as an artifact
    archiveLogs: true

    s3:
      endpoint: deap-api.xxx.com
      bucket: deap-argo-prod
      insecure: false
      keyFormat: stage\
        /{{workflow.creationTimestamp.Y}}\
        /{{workflow.creationTimestamp.m}}\
        /{{workflow.creationTimestamp.d}}\
        /{{workflow.name}}\
        /{{pod.name}}"
      accessKeySecret:
        name: argo-minio-credentials
        key: accessKey
      secretKeySecret:
        name: argo-minio-credentials
        key: secretKey
  executor: |
    imagePullPolicy: IfNotPresent
    image: xxx.com/deap/argoexec:3.5.0.282
  retentionPolicy: |
    completed: 400
  workflowDefaults: |
    spec:
      activeDeadlineSeconds: 86400
      podGC:
        strategy: OnPodSuccess

Does completed actually mean completed, or is it actually success?

I added:

failed: 50
errored: 50

and now it sort of works as expected.

@tooptoop4

devstewart commented 1 year ago

I believe I found the offending code: https://github.com/argoproj/argo-workflows/blob/80cd2ec6cffa5e33f8cce67435ba96e68a2a32b4/workflow/gccontroller/gc_controller.go#L140-L160

I believe the meaning of completed and successful, should not be interchangeable, and the code should be a retentionPolicy.Succeeded in addition to retentionPolicy.Completed. retentionPolicy.Completed would allow a max retention policy regardless of success or failure.

So if this is the definition was:

retentionPolicy: |
    completed: 400
    succeeded: 400
    failed: 50
    errored: 50

Then a max of 400 successful workflows would be kept, or 50 failed or errored or the oldest from all 3 queues in the event that there's 300 successful, 50 failed, and 50 errored workflows. Maybe that's overly complicated, and completed should be replaced with succeeded, but that would be a breaking change.

agilgur5 commented 1 year ago

Nice find @devstewart. Your interpretation sounds correct to me, and the spec refers to "Completed" as "Succeeded, Failed, or Errored" in many places.

This misnomer seems to have been missed in #6854. The original issue, #5369, does not mention "Succeeded" either, but does seem to correctly allude to "Completed" as a superset.

I'm open to either renaming or making "Completed" actually work as seemingly originally intended (and thereby adding a new "Succeeded" as well).

If we rename, we can make it backward-compatible for the time being by accepting both names and warning when it is the incorrect "Completed" name.

Making "Completed" work I think can be done without any new data or indexes. Just take the total of all 3 queues, and if needed, compare the oldest 3 to each other and GC the oldest of the 3. That may need some careful locking though if there are concurrent reads/writes to the queues.

@devstewart would you like to fix this/implement that?

devstewart commented 1 year ago

@agilgur5

@devstewart would you like to fix this/implement that?

I wouldn't mind attempting to do either. I haven't contributed to Argo before, so I'll get setup and see how far I can get. It's a fairly small and seemingly confined piece of part of the code base so it seems like an okay first contribution, and it would be ideal if "completed" functioned as a superset here.

agilgur5 commented 1 year ago

It's a fairly small and seemingly confined piece of part of the code base so it seems like an okay first contribution

yup, my thoughts exactly! and you already found the root cause, which is often (but not always) the more difficult half of solving the problem. if you've got the root cause down, coding the solution can sometimes be pretty straightforward, which I think it is in this case.

agilgur5 commented 10 months ago

Hi @devstewart have you made any progress on this? I'd like to fix this bug, so if you don't have plans to work on it soon, I'll take it on myself