argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

once-off error under load, `{{retries}}` not replaced #13799

Open tooptoop4 opened 1 month ago

tooptoop4 commented 1 month ago

Pre-requisites

What happened? What did you expect to happen?

my wf is a dag and has snippet like:

  templates:
    - name: redact-wf
      dag:
        tasks:
        - name: redact
          depends: redact.Succeeded
          templateRef:
            name: redact
            template: main
          arguments:
            parameters:
            - name: redact_name
              value: "redact-{{workflow.name}}-{{retries}}"

There have been 100s of successful runs of this workflow but only 1 run where it failed (my own code inside tried to parse the section after last hyphen as int) as the parameter that went into the pod did not have the {{retries}} value substituted, strangely the ui shows it as replaced with the 0 for retries but pod logs below show that it wasn't

controller logs:

time=\"2024-10-21T20:01:58.682Z\" level=info msg=\"Transient error: Operation cannot be fulfilled on resourcequotas \\\"myresourcequota\\\": the object has been modified; please apply your changes to the latest version and try again\"
time=\"2024-10-21T20:01:58.682Z\" level=info msg=\"Transient error: Operation cannot be fulfilled on resourcequotas \\\"myresourcequota\\\": the object has been modified; please apply your changes to the latest version and try again\"
time=\"2024-10-21T20:01:58.682Z\" level=info msg=\"Mark node redact(0).redact(0)[1].redact(0) as Pending, due to: Operation cannot be fulfilled on resourcequotas \\\"myresourcequota\\\": the object has been modified; please apply your changes to the latest version and try again\" namespace=redact workflow=redact
time=\"2024-10-21T20:01:58.683Z\" level=info msg=\"node redact-redactid message: Operation cannot be fulfilled on resourcequotas \\\"myresourcequota\\\": the object has been modified; please apply your changes to the latest version and try again\" namespace=redact workflow=redact
time=\"2024-10-21T20:02:17.877Z\" level=info msg=\"node changed\" namespace=redact new.message= new.phase=Running new.progress=0/1 nodeID=redact-redactid old.message=\"Operation cannot be fulfilled on resourcequotas \\\"myresourcequota\\\": the object has been modified; please apply your changes to the latest version and try again\" old.phase=Pending old.progress=0/1 workflow=redact

my pod logs show that the {{retries}} was not properly replaced with 0:

time=\"2024-10-21T20:02:08.957Z\" level=info msg=\"Executor initialized\" deadline=\"2024-10-redact 09:00:45 +0000 UTC\" includeScriptOutput=false namespace=redact podName=redact template=\"{\\\"name\\\":\\\"redact\\\",\\\"inputs\\\":{\\\"parameters\\\":[{\\\"name\\\":\\\"job_name\\\",\\\"value\\\":\\\"redactwf-redact-{{retries}}\\\"}

my resourcequota limits were not hit but the controller was busy with 100s of "cleaning up pod" from a different workflow

may be related? https://github.com/argoproj/argo-workflows/blob/v3.4.11/util/template/expression_template.go#L35-L40 seems like allowUnresolved is passed in as true at https://github.com/argoproj/argo-workflows/blame/v3.4.11/workflow/common/util.go#L286 https://github.com/argoproj/argo-workflows/issues/13123 but i don't use podspecpatch

it was able to replace {{workflow.name}} but not {{retries}}

Version(s)

3.4.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
n/a

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
n/a
github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

tooptoop4 commented 1 week ago

maybe similar to https://github.com/argoproj/argo-workflows/issues/13780