argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.08k stars 3.2k forks source link

Set cap on retryStrategy backoff #13772

Open eiriklid opened 4 weeks ago

eiriklid commented 4 weeks ago

Summary

I want to be able to set a cap on the backoff time in exponential backoff, avoiding potentially very long backoff times.

Use Cases

When would you use this? I want to have quick retries in the beginning, but not have to have a exponential backoff as well. The Pod Backoff Failure Policy from Kubernetes have been ideal:

... Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.

Original discussion with @agilgur5 and @Joibel in #13584.


Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.

agilgur5 commented 4 weeks ago

Original discussion with @agilgur5 and @Joibel in #13584.

Copying my response there for ease of access and as a potential solution:

Not according to the documentation.

Yea, my bad, you're correct. I keep mixing up what maxDuration means and that specific doc was recently updated actually: https://github.com/argoproj/argo-workflows/pull/13068. Now that I'm remembering, I called out this specific field as super unclear in feedback when I beta tested the CNCF Argo Certification back in January 😅 (there was a question specifically on it and I advocated to remove that question)

Argo's backoff also inherits from k8s apimachinery's wait package

In another layer of confusion, the Argo Backoff spec actually differs from apimachinery's Backoff apparently. In apimachinery, Backoff.Cap sounds like what you're looking for, literally. Looking at the blames, it's possible that Cap didn't exist at the time Argo's Backoff was created: https://github.com/kubernetes/apimachinery/commit/e52d7e07dd281a5895a88fcd785cd52899ce72b3 / https://github.com/kubernetes/kubernetes/pull/71088 was a year before https://github.com/argoproj/argo-workflows/pull/1782, but using an older k8s version might've prevented usage.

Would you like to file a feature request for this? Can reference this discussion and the above