argoproj / argo-workflows

Workflow Engine for Kubernetes

https://argo-workflows.readthedocs.io/

Apache License 2.0

15.08k stars 3.2k forks source link

Allow transient errors to not be counted against the retry limit #6498

Open dan-katz opened 3 years ago

dan-katz commented 3 years ago

Summary

Add a flag to RetryStrategy to not count errors defined as transient (those defined in the codebase and those that match the TRANSIENT_ERROR_PATTERN environment variable) against the retry limit. I could also see never counting transient errors against the retry limit being the way to go instead

Use Cases

If you use preemptible nodes (or the equivalent), retries from pod deletion errors are counted against the limit but you may want to always retry those kinds of errors
If you get an API timeout error you may not want those retries to count against the limit

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

dan-katz commented 3 years ago

If the Argo team thinks this is a good idea/feasible but that it's not high priority I'm happy to take a crack at it myself

terrytangyuan commented 3 years ago

I had a use case where we only want to retry transient errors and the support has been added in https://github.com/argoproj/argo-workflows/pull/4999.

I think this is also a similar and a valid use case. Others please confirm.

In the meantime, it would be great to think about whether a flag would be a good option and how this relates to the existing OnTransientError policy.

sarabala1979 commented 3 years ago

@terrytangyuan As per code, this PR #4999 changes will not hit in real scenario. All TransientError will make node status to pending and it should not take retry limit. Am I missing anything ?

        if errorsutil.IsTransientErr(err) {
            // Error was most likely caused by a lack of resources.
            // In this case, Workflow will be in pending state and requeue.
            woc.markWorkflowPhase(ctx, wfv1.WorkflowPending, fmt.Sprintf("Waiting for a PVC to be created. %v", err))
            woc.requeue()
            return
        }

dan-katz commented 3 years ago

@terrytangyuan for your use case do you want to retry transient errors a set number of times or would it be ok to always retry them? I can also see how changing the transient error behavior to always retry might break some users' existing usage of onTransientError

terrytangyuan commented 3 years ago

@sarabala1979 Looks like the PR might have missed some local commits I had in our internal fork and the current logic does not seem correct. Unfortunately I lost access to that fork so I will have to take a fresh look when I get a chance. Thanks for pointing it out!

@dan-katz Yes, we do want to retry transient errors a fixed number of times, especially if the cluster is under high load we will highly likely fail again so it's better to stop retrying.

sarabala1979 commented 3 years ago

@dan-katz @terrytangyuan Let me separate the discussion.
@dan-katz In your scenario, For Transient Error should count in retry. It should work with the current code. as long as if it is fit in below transient errors. Please make sure your env is set correctly. https://github.com/argoproj/argo-workflows/blob/4af01131889a48989db0c251b8d9711e19ca3325/util/errors/errors.go#L16

@terrytangyuan Can you create a separate issue for onTransientError for Retry.