argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.87k stars 3.17k forks source link

retryStrategy expression doesn't work for status.phase "Error" #10980

Closed tico24 closed 1 year ago

tico24 commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

When a workflow node errors and it has a retryStrategy expression looking for lastRetry.status, the retryStrategy is not envoked.

This workflow runs as expected. It is retried 3 times becuse the workflow node fails:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: designed-to-fail-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
spec:
  entrypoint: test
  templates:
  - name: test
    retryStrategy:
      limit: "3"
      expression: lastRetry.status == "Failed"
    container:
      image: ubuntu
      command:
        - sh
        - -c
        - |
          exit 64

image

However, if you run the following workflow, wait for the node to be running in the UI and then kill the pod using kubectl. We see that the node is marked as Errored, but the retryStrategy did not kick in, the node was not retried:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: designed-to-error-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
spec:
  entrypoint: test
  templates:
  - name: test
    retryStrategy:
      limit: "3"
      expression: lastRetry.status == "Error"
    container:
      image: ubuntu
      command:
        - sh
        - -c
        - |
          sleep 5000000

image

Version

v3.4.5 > latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

see above

Logs from the workflow controller

time="2023-04-25T09:45:23.589Z" level=info msg="Processing workflow" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:23.590Z" level=info msg="Task-result reconciliation" namespace=ci numObjs=0 workflow=designed-to-error-plflr
time="2023-04-25T09:45:23.590Z" level=info msg="node changed" namespace=ci new.message= new.phase=Running new.progress=0/1 nodeID=designed-to-error-plflr-1299940749 old.message= old.phase=Pending old.progress=0/1 workflow=designed-to-error-plflr
time="2023-04-25T09:45:23.590Z" level=info msg="node designed-to-error-plflr message: retryStrategy.expression evaluated to false" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:23.590Z" level=info msg="TaskSet Reconciliation" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:23.590Z" level=info msg=reconcileAgentPod namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:23.590Z" level=info msg="Workflow to be dehydrated" Workflow Size=1862
time="2023-04-25T09:45:23.600Z" level=info msg="Update workflows 200"
time="2023-04-25T09:45:23.600Z" level=info msg="Workflow update successful" namespace=ci phase=Running resourceVersion=64560163 workflow=designed-to-error-plflr
time="2023-04-25T09:45:23.606Z" level=info msg="Create events 201"
time="2023-04-25T09:45:25.036Z" level=info msg="Get leases 200"
time="2023-04-25T09:45:25.042Z" level=info msg="Update leases 200"
time="2023-04-25T09:45:30.050Z" level=info msg="Get leases 200"
time="2023-04-25T09:45:30.054Z" level=info msg="Update leases 200"
time="2023-04-25T09:45:30.470Z" level=info msg="Enforcing history limit for 'git-backup'" namespace=ci workflow=git-backup
time="2023-04-25T09:45:30.470Z" level=info msg="Enforcing history limit for 'renovate-application-definitions'" namespace=ci workflow=renovate-application-definitions
time="2023-04-25T09:45:30.470Z" level=info msg="Enforcing history limit for 'renovate-cloud-infra'" namespace=ci workflow=renovate-cloud-infra
time="2023-04-25T09:45:30.470Z" level=info msg="Enforcing history limit for 'tf-nightly'" namespace=ci workflow=tf-nightly
time="2023-04-25T09:45:33.600Z" level=info msg="Processing workflow" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="Task-result reconciliation" namespace=ci numObjs=0 workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="Workflow pod is missing" namespace=ci nodeName="designed-to-error-plflr(0)" nodePhase=Running recentlyStarted=false workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="node designed-to-error-plflr-1299940749 phase Running -> Error" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="node designed-to-error-plflr-1299940749 message: pod deleted" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="node designed-to-error-plflr-1299940749 finished: 2023-04-25 09:45:33.601482457 +0000 UTC" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="Node not set to be retried after status: Error" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="node designed-to-error-plflr phase Running -> Error" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="node designed-to-error-plflr message: pod deleted" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="node designed-to-error-plflr finished: 2023-04-25 09:45:33.601633504 +0000 UTC" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="TaskSet Reconciliation" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg=reconcileAgentPod namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="Updated phase Running -> Error" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="Updated message  -> pod deleted" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="Marking workflow completed" namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="Checking daemoned children of " namespace=ci workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.601Z" level=info msg="Workflow to be dehydrated" Workflow Size=1931
time="2023-04-25T09:45:33.607Z" level=info msg="cleaning up pod" action=deletePod key=ci/designed-to-error-plflr-1340600742-agent/deletePod
time="2023-04-25T09:45:33.610Z" level=info msg="Update workflows 200"
time="2023-04-25T09:45:33.610Z" level=info msg="Workflow update successful" namespace=ci phase=Error resourceVersion=64560223 workflow=designed-to-error-plflr
time="2023-04-25T09:45:33.610Z" level=info msg="Create events 201"
time="2023-04-25T09:45:33.611Z" level=info msg="Queueing Error workflow ci/designed-to-error-plflr for delete in 1h0m0s due to TTL"
time="2023-04-25T09:45:33.612Z" level=info msg="Delete pods 404"
time="2023-04-25T09:45:33.614Z" level=info msg="Create events 201"
time="2023-04-25T09:45:33.626Z" level=info msg="Create events 201"
time="2023-04-25T09:45:33.631Z" level=info msg="DeleteCollection workflowtaskresults 200"
time="2023-04-25T09:45:35.062Z" level=info msg="Get leases 200"
time="2023-04-25T09:45:35.067Z" level=info msg="Update leases 200"

Logs from in your workflow's wait container

n/a - it has to be deleted to trigger the error.
Joibel commented 1 year ago

The default retryPolicy is OnFailure. Setting this to Always does what you'd wish.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: designed-to-error-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
spec:
  entrypoint: test
  templates:
  - name: test
    retryStrategy:
      limit: "3"
      expression: lastRetry.status == "Error"
      retryPolicy: Always
    container:
      image: ubuntu
      command:
        - sh
        - -c
        - |
          sleep 5000000

I think the presence of an expression should override retryPolicy so that the result of the expression evaluation is all that matters for retries. Unless someone disagrees soon, I'll write a PR for that.

maxsxu commented 1 year ago

@Joibel I think the eventual retry behavior should be the AND result of expression and retryPolicy. which means we should have the following rules:

Let's take a more obvious example: what's the expected behavior for the below Workflow?

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: designed-to-fail-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
spec:
  entrypoint: test 
  templates:
  - name: test
    retryStrategy:
      limit: "3"
      expression: lastRetry.status == "Failed"
      retryPolicy: OnError
    container:
      image: ubuntu
      command:
        - sh
        - -c
        - |
          exit 64

The current actual behavior is it won't retry and I think it's expected.


So back to the original issue, If we want to retry when lastRetry.status == "Error", so we can just set retryPlicy to OnError or Always.

tico24 commented 1 year ago

@maxsxu I disagree. I would expect retryPolicy to be ignored in the presence of an expression because the expression is more 'expressive'. This is the natural understanding of at least one other company (who I raised the issue on behalf of).

Joibel commented 1 year ago

The workflow tico24 posted isn't obviously never going to retry. By reading it you'd assume it will behave well. Perhaps I'd accept that in the presence of both expression and and explicit retryPolicy we may never retry, but if I read

spec:
  entrypoint: test
  templates:
  - name: test
    retryStrategy:
      limit: "3"
      expression: lastRetry.status == "Error"
    container:
      image: ubuntu
      command:
        - sh
        - -c
        - |
          sleep 5000000

I really do NOT expect that to be absolutely identical to having told the system never to retry. We need the language to be clear, and right now it isn't.

If I was able to change everything I would forbid having both expression and retryPolicy in the same retryStrategy. I was proposing making retryPolicy do nothing, but perhaps a lesser change is to make it not do anything unless explicitly set.

Ideally @maxsxu's example

spec:
  entrypoint: test 
  templates:
  - name: test
    retryStrategy:
      limit: "3"
      expression: lastRetry.status == "Failed"
      retryPolicy: OnError

would be invalid by that, but as second best and so as not to break existing workflows it would explain to you that you've just written something that would never retry, but in the general case that's not possible.

So how about making the rule that, in the presence of an expression and NO explicit retryPolicy then the retryPolicy is effectively Always (e.g. the expression makes the decision). That fixes the original complaint and is even less of a change.

maxsxu commented 1 year ago

@Joibel Thanks for your thoughtful options.

So how about making the rule that, in the presence of an expression and NO explicit retryPolicy then the retryPolicy is effectively Always (e.g. the expression makes the decision). That fixes the original complaint and is even less of a change.

I'm OK with that. i.e, changing the default retryPolicy to Always.

@simster7 Would also like to know your thoughts as you're the original author.

Joibel commented 1 year ago

For clarity, I'm not proposing changing the default policy to be Always in the absence of an expression. In pseudocode my proposal is:

if expression == "" {
  defaultPolicy = OnFailure
} else {
  defaultPolicy = Always
}

In order to maintain compatibility with existing behaviour.

maxsxu commented 1 year ago

@Joibel I see, more accurately, I suppose you're proposing following. Did I catch you right?

if retryPolicy == "" {
  if expression == "" {
    defaultPolicy = OnFailure
  } else {
    defaultPolicy = Always
  }
}
Joibel commented 1 year ago

Yes, that's what I'm proposing.

maxsxu commented 1 year ago

Yes, that's what I'm proposing.

Thanks for clarifying, It makes sense.

I've no objection to this. Let's see what other people think about it.

mostaphaRoudsari commented 1 year ago

As someone who tried to use this feature and only based on the documentation without checking the source code, my expectation was that if an expression is provided it should always be considered regardless of the retryPolicy. In other words, it should overwrite the retryPolicy.

My reasoning is that the expression is the more advanced/customizable option that provides all the options that are provided by the retryPolicy. Having both at the same time can create conflicting cases. Why would one still need the retryPolicy if they set up an expression?

That said, I understand that you need to keep this as is for backward compatibility.