Task Retry in Workflow Definition

trstruth commented 5 years ago

SUMMARY

When orquesta workflow tasks fail, there should be a built-in syntax solution for retrying. Today our production workflows make use of a combination of decrementing then publishing retry counter var, and before routing to the task again. Retry is listed as an upcoming feature in the documentation and I'm interested in helping with its implementation.

ISSUE TYPE

Feature Idea

trstruth commented 5 years ago

In order to jump-start some discussion:

I've spent a couple of hours this morning reading through the orquesta source and inspecting the modules in a repl. It seems like the first step would be to modify the TaskSpec to take a retry key, then modify the composer to either generate an "unrolled" series of nodes that represent a potential retry, or make the conductor aware of the state of how many retries have been executed thus far. Do these approaches sound reasonable?

In reference to the README:

The criteria for task transition is an attribute of the edge.

Should retry be a task transition to itself?

The workflow conductor traverses the graph, directs the flow of the workflow execution, and tracks runtime state of the execution.

Should a "retry counter" be an attribute of the runtime state?

m4dcoder commented 5 years ago

Our current thinking is to put the retry in the task transition model. For example, when a task failed (or another other conditions), then retry the task execution.

next:
  - when: <% failed() %>
    retry: ...

In orquesta, the condition for task transition allows for flexibility (i.e. core.http succeeded but returns error code of 401). So if we put the retry in the task model, the orquesta conductor will not know under what condition or state it should honor the retries.

next:
  - when: <% succeeded() and result().status_code = 401 %>
    retry: ...

We can start the retry model with the following.

retry:
  delay: number of seconds before retry
  count: number of times to retry

On implementation, only thinking thru this briefly, the retry should be evaluated by the conductor in the update_task_flow methods at https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L562. If a retry is decided, the task should be returned by get_next_tasks at https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L479. In the st2workflowengine, we want to result in a new request (and thus a new DB record) for the task execution.

nmaludy commented 5 years ago

+1

m4dcoder commented 5 years ago

@trstruth How are you doing with this? Are you still planning to contribute this feature to orquesta?

trstruth commented 5 years ago

@m4dcoder have not gotten any time to write code for this thus far due to work, but I'm still interested in contributing it when I do. When are you guys planning the release for it?

m4dcoder commented 5 years ago

@trstruth We are near the deadline for st2 v3.0 and so this feature would be a good target for the next release (3.1) or the one after (3 months window). If you like to contribute, we want to help you be successful. So let us know how we can help you.

punkrokk commented 5 years ago

From @m4dcoder

We can start the retry model with the following.
retry:
  delay: number of seconds before retry
  count: number of times to retry

I would suggest the retry function also allows for some sort of backoff or back pressure. Maybe this overlaps with policies.

Consider the following use case:

I have a list of computers to patch. I then fire off say, a Tanium action to get all my computers to a baseline. Trouble is, there are 750 computers. While I can fire off the action to patch 750 computers in one action run, I need to over time continue to check the status of the patch process. The following things make the retry slightly more complex:

Some computers may be on vacation, in China, or off for awhile. That time period could be days, weeks etc.
We don't want to hit the API every 3 sec/min or even every 3 hours as we get farther away from the initial launch.

So the following might be useful:

 retry:
   delay: number of seconds before retry
   count: number of times to retry
   max_time_retry: number of seconds or minutes to stop retrying if we are still failing (e.g. two weeks) 
   wait_until_exponential: how many 'normal' retries before we go exponential
   wait_exponential_multiplier: number of ms to increase wait time exponentially
   wait_exponential_max: 10000

I've used this library in the past for other shorter lived actions with pxGrid. There are some other interesting options, like random retry delays.

I really like the idea of being able to detect if we get an http 401, but are looking for an http 200 and retry; or other functional variations.

trstruth commented 5 years ago

After further discussion with @m4dcoder I'm proposing we move the retry out of the TaskTransitionSpec and into the TaskSpec. When I think of a retry, it is an attribute of the task itself. After the retries are exhausted, then we concern ourselves with which task we want to transition to. We can do this without losing the conditional functionality that was originally proposed. Here's an example task written in this style.

  my_task:
    action: core.foo
    retry:
      count: 3
    next:
      - when: <% failed() %>
        do: fail

The retry spec can support something like:

retry:
  when: condition for retry
  delay: number of seconds before retry
  count: number of times to retry

This means we can say something along the lines of

  my_task:
    action: core.foo
    retry:
      when: <% failed() and result().status_code = 401 %>
      count: 3
    next:
      - when: <% failed() %>
        do: fail

@punkrokk I like the backoff idea, but I'm probably going to start with a smaller scoped implementation. After I finish that maybe we can revisit your idea.

m4dcoder commented 5 years ago

@trstruth Thank you for identifying the use cases, issues with the original proposal, and proposing the alternative solution. I'm looking forward to the PR.

m4dcoder commented 5 years ago

For background information, in our discussion, @trstruth highlighted issues about multiple retries defined in more than one task transition, how to handle them in the conductor, and how to fail the workflow (or do other logging/cleanup/remediation steps) after retries are exhausted.

punkrokk commented 5 years ago

Thanks for the feedback. I could open a separate issue if you like.

Cheers,

JP

On Mar 28, 2019, at 4:41 PM, W Chan notifications@github.com wrote:

@trstruth https://github.com/trstruth Thank you for identifying the use cases, issues with the original proposal, and proposing the alternative solution. I'm looking forward to the PR.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/StackStorm/orquesta/issues/118#issuecomment-477762746, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDGeixF5KtpCnVTV0jBtLrkVxdCNfIzks5vbSjigaJpZM4aR2ha.

userlocalhost commented 5 years ago

We also look forward to this feature becoming available.

@trstruth Is this under development?

nmaludy commented 5 years ago

Working on adding the retry syntax to the TaskSpec.

I've taken what @trstruth and @m4dcoder described in their last messages and think it's a great start.

My only question is, should the count property have a special value of 0 or -1 that means "retry forever"?

m4dcoder commented 5 years ago

I'm ok having -1 to mean retry forever and then 0 means no retry in case someone wants to pass a Jinja/YAQL expression that decide not to retry. When we integrate this back into st2, we need to make sure there's a test case to be able to cancel such task that is retrying forever.

trstruth commented 5 years ago

Just putting this up for consideration - I think retry forever is a bad pattern. The concept of "infinite retries" probably shouldn't actually be captured by a retry, but rather baked into the workflow logic itself. Couldn't one simply create a task transition that recursively references the same task to create the equivalent of an infinite retry?

blag commented 5 years ago

I'm not a fan of using count: 0 or count: -1 to indicate "no retry" or "retry forever", because that makes a user has to go look up documentation on what those special values mean. This makes workflow definitions more opaque to users as they read through it.

Instead of using special values, I would simply use the string none or null value to indicate disabling retry, and the string forever or the string infinity to indicate infinite retry.

And while I would tend to agree with @trstruth that retry forever is a bad workflow pattern, I would expect some of our users will require that.

And retries can (and for some of our internal workflows, already are) be implemented in the workflow logic itself:

vars:
  - do_thing_retry_delay: 10  # seconds
  - do_thing_count: 5
  - do_thing_iteration: 0
tasks:
  do_thing:
    action: ...
    next:
      # try again
      - when: <% failed() and ctx().do_thing_iteration < ctx().do_thing_count %>
        publish:
          - do_thing_iteration: <% ctx().do_thing_iteration + 1 %>
        do: sleep_and_try_again
      - when: <% failed() and ctx().do_thing_iteration >= ctx().do_thing_count %>
        do: fail_entire_workflow
      - when: <% succeeded() %>
        do: next_thing
  sleep_and_try_again:
    action: core.local
    input:
      command: sleep <% ctx().do_thing_retry_delay %>
    next:
      - do: do_thing

That retry logic is difficult to parse and not straightforward.

Another problem is that the execution history is expanded every time the retry task is executed:

* do_thing             # failed
* sleep_and_try_again  # retry
* do_thing             # failed
* sleep_and_try_again  # retry
* do_thing             # succeeded
* next_thing

At least in Mistral^*, when retry is used explicitly the failed tasks don't appear in the execution history:

* do_thing             # failed, failed, succeeded
* next_thing

^* I'm not saying we have to implement things exactly like Mistral.

punkrokk commented 5 years ago

Would it make sense to have a global setting that times out forever after x minutes?

Cheers,

JP

On Jul 23, 2019, at 5:46 PM, Drew H notifications@github.com wrote:

I'm not a fan of using count: 0 or count: -1 to indicate "no retry" or "retry forever", because that makes a user has to go look up documentation on what those special values mean. This makes workflow definitions more opaque to users as they read through it.

Instead of using special values, I would simply use the string none or null value to indicate disabling retry, and the string forever or the string infinity to indicate infinite retry.

And while I would tend to agree with @trstruth that retry forever is a bad workflow pattern, I would expect some of our users will require that.

And retries can (and for some of our internal workflows, already are) be implemented in the workflow logic itself:

vars:

do_thing_retry_delay: 10 # seconds

do_thing_count: 5

do_thing_iteration: 0 tasks: do_thing: action: ... next:
try again

when: <% failed() and ctx().do_thing_iteration < ctx().do_thing_count %> publish:

do_thing_iteration: <% ctx().do_thing_iteration + 1 %> do: sleep_and_try_again

when: <% failed() and ctx().do_thing_iteration >= ctx().do_thing_count %> do: fail_entire_workflow

when: <% succeeded() %> do: next_thing sleep_and_try_again: action: core.local input: command: sleep <% ctx().do_thing_retry_delay %> next:

do: do_thing That retry logic is difficult to parse and not straightforward.

Another problem is that the execution history is expanded every time the retry task is executed:

do_thing # failed

sleep_and_try_again # retry

do_thing # failed

sleep_and_try_again # retry

do_thing # succeeded

next_thing At least in Mistral*, when retry is used explicitly the failed tasks don't appear in the execution history:

do_thing # failed, failed, succeeded

next_thing

I'm not saying we have to implement things exactly like Mistral.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

StackStorm / orquesta

Task Retry in Workflow Definition #118

SUMMARY

ISSUE TYPE

try again