Open devstewart opened 1 year ago
@devstewart do u have retentionpolicy block like https://github.com/argoproj/argo-workflows/blob/f88ae8885745d718dd1cebd9ca7ac63349133d1c/manifests/quick-start-minimal.yaml#L1448 ?
Looks like this is a follow-up to #12082
@tooptoop4 indeed, per that discussion, they do have a retentionPolicy
@devstewart do u have retentionpolicy block like
?
Yes, our retention policy is set to 400, but we only get about 100 - 200 new workflows a day and the failed workflows are deleted immediately.
Here's the yaml
for our workflow-controller-configmap
:
apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-controller-configmap
namespace: argo
data:
artifactRepository: |-
# archiveLogs will archive the main container logs as an artifact
archiveLogs: true
s3:
endpoint: deap-api.xxx.com
bucket: deap-argo-prod
insecure: false
keyFormat: stage\
/{{workflow.creationTimestamp.Y}}\
/{{workflow.creationTimestamp.m}}\
/{{workflow.creationTimestamp.d}}\
/{{workflow.name}}\
/{{pod.name}}"
accessKeySecret:
name: argo-minio-credentials
key: accessKey
secretKeySecret:
name: argo-minio-credentials
key: secretKey
executor: |
imagePullPolicy: IfNotPresent
image: xxx.com/deap/argoexec:3.5.0.282
retentionPolicy: |
completed: 400
workflowDefaults: |
spec:
activeDeadlineSeconds: 86400
podGC:
strategy: OnPodSuccess
Does completed
actually mean completed, or is it actually success?
I added:
failed: 50
errored: 50
and now it sort of works as expected.
@tooptoop4
I believe I found the offending code: https://github.com/argoproj/argo-workflows/blob/80cd2ec6cffa5e33f8cce67435ba96e68a2a32b4/workflow/gccontroller/gc_controller.go#L140-L160
I believe the meaning of completed and successful, should not be interchangeable, and the code should be a retentionPolicy.Succeeded
in addition to retentionPolicy.Completed
. retentionPolicy.Completed
would allow a max retention policy regardless of success or failure.
So if this is the definition was:
retentionPolicy: |
completed: 400
succeeded: 400
failed: 50
errored: 50
Then a max of 400 successful workflows would be kept, or 50 failed or errored or the oldest from all 3 queues in the event that there's 300 successful, 50 failed, and 50 errored workflows. Maybe that's overly complicated, and completed
should be replaced with succeeded
, but that would be a breaking change.
Nice find @devstewart. Your interpretation sounds correct to me, and the spec refers to "Completed" as "Succeeded, Failed, or Errored" in many places.
This misnomer seems to have been missed in #6854. The original issue, #5369, does not mention "Succeeded" either, but does seem to correctly allude to "Completed" as a superset.
I'm open to either renaming or making "Completed" actually work as seemingly originally intended (and thereby adding a new "Succeeded" as well).
If we rename, we can make it backward-compatible for the time being by accepting both names and warning when it is the incorrect "Completed" name.
Making "Completed" work I think can be done without any new data or indexes. Just take the total of all 3 queues, and if needed, compare the oldest 3 to each other and GC the oldest of the 3. That may need some careful locking though if there are concurrent reads/writes to the queues.
@devstewart would you like to fix this/implement that?
@agilgur5
@devstewart would you like to fix this/implement that?
I wouldn't mind attempting to do either. I haven't contributed to Argo before, so I'll get setup and see how far I can get. It's a fairly small and seemingly confined piece of part of the code base so it seems like an okay first contribution, and it would be ideal if "completed" functioned as a superset here.
It's a fairly small and seemingly confined piece of part of the code base so it seems like an okay first contribution
yup, my thoughts exactly! and you already found the root cause, which is often (but not always) the more difficult half of solving the problem. if you've got the root cause down, coding the solution can sometimes be pretty straightforward, which I think it is in this case.
Hi @devstewart have you made any progress on this? I'd like to fix this bug, so if you don't have plans to work on it soon, I'll take it on myself
Pre-requisites
:latest
What happened/what you expected to happen?
Submitted workflow which is expected to fail and it does, but the workflow gets deleted immediately.
Within seconds I get workflow gone.
I expected the failed workflows to remain as they did in earlier versions. I do not have the
workflow-controller-configmap
configured to do this. Relevant configuration:Version
v3.5.0
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container