Open dan-katz opened 3 years ago
If the Argo team thinks this is a good idea/feasible but that it's not high priority I'm happy to take a crack at it myself
I had a use case where we only want to retry transient errors and the support has been added in https://github.com/argoproj/argo-workflows/pull/4999.
I think this is also a similar and a valid use case. Others please confirm.
In the meantime, it would be great to think about whether a flag would be a good option and how this relates to the existing OnTransientError
policy.
@terrytangyuan As per code, this PR #4999 changes will not hit in real scenario. All TransientError will make node status to pending
and it should not take retry limit. Am I missing anything ?
if errorsutil.IsTransientErr(err) {
// Error was most likely caused by a lack of resources.
// In this case, Workflow will be in pending state and requeue.
woc.markWorkflowPhase(ctx, wfv1.WorkflowPending, fmt.Sprintf("Waiting for a PVC to be created. %v", err))
woc.requeue()
return
}
@terrytangyuan for your use case do you want to retry transient errors a set number of times or would it be ok to always retry them? I can also see how changing the transient error behavior to always retry might break some users' existing usage of onTransientError
@sarabala1979 Looks like the PR might have missed some local commits I had in our internal fork and the current logic does not seem correct. Unfortunately I lost access to that fork so I will have to take a fresh look when I get a chance. Thanks for pointing it out!
@dan-katz Yes, we do want to retry transient errors a fixed number of times, especially if the cluster is under high load we will highly likely fail again so it's better to stop retrying.
@dan-katz @terrytangyuan Let me separate the discussion.
@dan-katz In your scenario, For Transient Error should count in retry. It should work with the current code. as long as if it is fit in below transient errors. Please make sure your env is set correctly.
https://github.com/argoproj/argo-workflows/blob/4af01131889a48989db0c251b8d9711e19ca3325/util/errors/errors.go#L16
@terrytangyuan Can you create a separate issue for onTransientError for Retry.
Summary
Add a flag to
RetryStrategy
to not count errors defined as transient (those defined in the codebase and those that match theTRANSIENT_ERROR_PATTERN
environment variable) against the retry limit. I could also see never counting transient errors against the retry limit being the way to go insteadUse Cases
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.