Open tczhao opened 5 months ago
code that determines how retryPolicy is used https://github.com/argoproj/argo-workflows/blob/v3.5.7/workflow/controller/operator.go#L1060-L1077
thoughts:
to make this backward compatible, the retryPolicy field has to be a string.
having OnErrorTransientError
OnFailureTransientError
seems to be an easy fix, looking for opinions
I wonder if we should make the retryPolicy
variants available as keywords in the expression
instead? That would support any combination then. Could also potentially make retryPolicy
itself no longer needed.
Also cc @Joibel who's worked on the two before (#11005) and may have an opinion
OnError
is a strict superset of OnTransientError
. So if you specify OnError
you will get all the retries you'd get with OnTransientError
and more, so there is never a need to use both strategies at the same time.
If you needed OnFailure
+ OnTransientError
then perhaps an isTransient() helper to put in the expression would be useful.
Fixing #13058 seems like it would help the most though.
I think what we want is
OnError
+ some patterns (could come from node Error
or Failure
)
The current expression
is not enough since its applied as a filter before evaluating retryPolicy.
fixing https://github.com/argoproj/argo-workflows/issues/13058 is not enough for this usecase e.g.
I think we could have keywords/parameters,
in the expression, this should be sufficient for all usecase
I think we could have keywords/parameters,
Yea that's more or less what I was suggesting earlier
I think what we want is
OnError
+ some patterns (could come from nodeError
orFailure
)The current
expression
is not enough since its applied as a filter before evaluating retryPolicy.fixing #13058 is not enough for this usecase e.g.
- retryAlways + expression (need to list out all possible onError related failure)
- retryOnError + expression (can not support node Failure situation)
- retryOnFailure + expression (can not support node Error situation)
- retryOnTransientError + expression (need to list out all possible onError related failure)
Some corrections fixing https://github.com/argoproj/argo-workflows/issues/13058 is enough for this use case e.g. retryAlways + expression(lastRetry.status=="Errored" or custom-pattern)
If you needed OnFailure + OnTransientError then perhaps an isTransient() helper to put in the expression would be useful.
This is correct. This is the only edge case where the current implementation is not able to support
Hi @Joibel since
OnError is a strict superset of OnTransientError
Could we include TRANSIENT_ERROR_PATTERN
as part of OnError
as well?
Summary
What change needs making?
We want to use both
OnError
andOnTransientError
https://argo-workflows.readthedocs.io/en/stable/retries/#retry-policies But only one type is supported at the moment.We have workarounds
workaround 1
OnError
+ expressionbut we are facing issue https://github.com/argoproj/argo-workflows/issues/13058
workaround 2
OnTransientError
+ patterns fromOnError
issue: takes time to identify all argo internal error, and every error is a step back in SLA
Use Cases
When would you use this? To retry on both argo internal error
OnError
and transient errorOnTransientError
Message from the maintainers:
Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.