argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.88k stars 3.17k forks source link

support `retryPolicy` combination #13119

Open tczhao opened 3 months ago

tczhao commented 3 months ago

Summary

What change needs making?

We want to use both OnError and OnTransientError https://argo-workflows.readthedocs.io/en/stable/retries/#retry-policies But only one type is supported at the moment.

We have workarounds

workaround 1 OnError + expression

  retryStrategy:
    retryPolicy: OnError
    expression: >-
      lastRetry.message contains 'Pod was terminated in response to imminent node shutdown' 

but we are facing issue https://github.com/argoproj/argo-workflows/issues/13058

workaround 2 OnTransientError + patterns from OnError

  retryStrategy:
    retryPolicy: OnTransientError
# ...
    env:
      - name: TRANSIENT_ERROR_PATTERN
        value: "(pod_deleted|post workflowtaskresults)"

issue: takes time to identify all argo internal error, and every error is a step back in SLA

Use Cases

When would you use this? To retry on both argo internal error OnError and transient error OnTransientError


Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.

tczhao commented 3 months ago

code that determines how retryPolicy is used https://github.com/argoproj/argo-workflows/blob/v3.5.7/workflow/controller/operator.go#L1060-L1077

thoughts: to make this backward compatible, the retryPolicy field has to be a string. having OnErrorTransientError OnFailureTransientError seems to be an easy fix, looking for opinions

agilgur5 commented 3 months ago

I wonder if we should make the retryPolicy variants available as keywords in the expression instead? That would support any combination then. Could also potentially make retryPolicy itself no longer needed.

Also cc @Joibel who's worked on the two before (#11005) and may have an opinion

Joibel commented 3 months ago

OnError is a strict superset of OnTransientError. So if you specify OnError you will get all the retries you'd get with OnTransientError and more, so there is never a need to use both strategies at the same time.

If you needed OnFailure + OnTransientError then perhaps an isTransient() helper to put in the expression would be useful.

Fixing #13058 seems like it would help the most though.

tczhao commented 3 months ago

I think what we want is OnError + some patterns (could come from node Error or Failure)


The current expression is not enough since its applied as a filter before evaluating retryPolicy.

fixing https://github.com/argoproj/argo-workflows/issues/13058 is not enough for this usecase e.g.

tczhao commented 3 months ago

I think we could have keywords/parameters,

in the expression, this should be sufficient for all usecase

agilgur5 commented 3 months ago

I think we could have keywords/parameters,

Yea that's more or less what I was suggesting earlier

tczhao commented 3 months ago

I think what we want is OnError + some patterns (could come from node Error or Failure)

The current expression is not enough since its applied as a filter before evaluating retryPolicy.

fixing #13058 is not enough for this usecase e.g.

  • retryAlways + expression (need to list out all possible onError related failure)
  • retryOnError + expression (can not support node Failure situation)
  • retryOnFailure + expression (can not support node Error situation)
  • retryOnTransientError + expression (need to list out all possible onError related failure)

Some corrections fixing https://github.com/argoproj/argo-workflows/issues/13058 is enough for this use case e.g. retryAlways + expression(lastRetry.status=="Errored" or custom-pattern)

If you needed OnFailure + OnTransientError then perhaps an isTransient() helper to put in the expression would be useful.

This is correct. This is the only edge case where the current implementation is not able to support

tczhao commented 3 months ago

Hi @Joibel since

OnError is a strict superset of OnTransientError

Could we include TRANSIENT_ERROR_PATTERN as part of OnError as well?