argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

When creating an agent pod, transient error is not retried #13654

Closed fyp711 closed 1 month ago

fyp711 commented 2 months ago

Pre-requisites

What happened? What did you expect to happen?

What happened ?

When I use the HTTP template, I find it pending all the time. image Then I saw that the workflow status was error, and encountered a transient error. image error is : failed to create Agent pod. Reason: Operation cannot be fulfilled on resourcequotas "xxxx": the object has been modified; please apply your changes to the latest version and try again

What did you expect to happen?

I hope to encounter transient errors that can be retried.

Version(s)

v3.4.17

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

None

Logs from the workflow controller

None

Logs from in your workflow's wait container

None
fyp711 commented 1 month ago

Is anyone paying attention to this issue?

shuangkun commented 1 month ago

Is it possible to have a stable and reproducible workflow?

fyp711 commented 1 month ago

Is it possible to have a stable and reproducible workflow?

ResourceQuota conflicts are not easy to reproduce. I simulated another resourcequota shortage situation by modifying the requests value of the agent pod for reproduction and testing.

  1. Change this value to an impossible value. https://github.com/argoproj/argo-workflows/blob/0dfecd6e3a18c7bb884000e1e98d8305440d8d49/workflow/controller/agent.go#L180
  2. Create and run an HTTP template,then the agent pod will create failed.
  3. The workflow failed.

if you have others can reproduce the transient error, it's always ok

fyp711 commented 1 month ago

@shuangkun cc

shuangkun commented 1 month ago

okay