Closed trstruth closed 4 years ago
In order to jump-start some discussion:
I've spent a couple of hours this morning reading through the orquesta source and inspecting the modules in a repl. It seems like the first step would be to modify the TaskSpec
to take a retry key, then modify the composer to either generate an "unrolled" series of nodes that represent a potential retry, or make the conductor aware of the state of how many retries have been executed thus far. Do these approaches sound reasonable?
In reference to the README:
The criteria for task transition is an attribute of the edge.
Should retry be a task transition to itself?
The workflow conductor traverses the graph, directs the flow of the workflow execution, and tracks runtime state of the execution.
Should a "retry counter" be an attribute of the runtime state?
Our current thinking is to put the retry in the task transition model. For example, when a task failed (or another other conditions), then retry the task execution.
next:
- when: <% failed() %>
retry: ...
In orquesta, the condition for task transition allows for flexibility (i.e. core.http succeeded but returns error code of 401). So if we put the retry in the task model, the orquesta conductor will not know under what condition or state it should honor the retries.
next:
- when: <% succeeded() and result().status_code = 401 %>
retry: ...
We can start the retry model with the following.
retry:
delay: number of seconds before retry
count: number of times to retry
On implementation, only thinking thru this briefly, the retry should be evaluated by the conductor in the update_task_flow
methods at https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L562. If a retry is decided, the task should be returned by get_next_tasks
at https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L479. In the st2workflowengine, we want to result in a new request (and thus a new DB record) for the task execution.
+1
@trstruth How are you doing with this? Are you still planning to contribute this feature to orquesta?
@m4dcoder have not gotten any time to write code for this thus far due to work, but I'm still interested in contributing it when I do. When are you guys planning the release for it?
@trstruth We are near the deadline for st2 v3.0 and so this feature would be a good target for the next release (3.1) or the one after (3 months window). If you like to contribute, we want to help you be successful. So let us know how we can help you.
From @m4dcoder
We can start the retry model with the following.
retry: delay: number of seconds before retry count: number of times to retry
I would suggest the retry function also allows for some sort of backoff or back pressure. Maybe this overlaps with policies.
Consider the following use case:
I have a list of computers to patch. I then fire off say, a Tanium action to get all my computers to a baseline. Trouble is, there are 750 computers. While I can fire off the action to patch 750 computers in one action run, I need to over time continue to check the status of the patch process. The following things make the retry slightly more complex:
So the following might be useful:
retry:
delay: number of seconds before retry
count: number of times to retry
max_time_retry: number of seconds or minutes to stop retrying if we are still failing (e.g. two weeks)
wait_until_exponential: how many 'normal' retries before we go exponential
wait_exponential_multiplier: number of ms to increase wait time exponentially
wait_exponential_max: 10000
I've used this library in the past for other shorter lived actions with pxGrid. There are some other interesting options, like random retry delays.
I really like the idea of being able to detect if we get an http 401, but are looking for an http 200 and retry; or other functional variations.
After further discussion with @m4dcoder I'm proposing we move the retry out of the TaskTransitionSpec
and into the TaskSpec
. When I think of a retry, it is an attribute of the task itself. After the retries are exhausted, then we concern ourselves with which task we want to transition to. We can do this without losing the conditional functionality that was originally proposed. Here's an example task written in this style.
my_task:
action: core.foo
retry:
count: 3
next:
- when: <% failed() %>
do: fail
The retry spec can support something like:
retry:
when: condition for retry
delay: number of seconds before retry
count: number of times to retry
This means we can say something along the lines of
my_task:
action: core.foo
retry:
when: <% failed() and result().status_code = 401 %>
count: 3
next:
- when: <% failed() %>
do: fail
@punkrokk I like the backoff idea, but I'm probably going to start with a smaller scoped implementation. After I finish that maybe we can revisit your idea.
@trstruth Thank you for identifying the use cases, issues with the original proposal, and proposing the alternative solution. I'm looking forward to the PR.
For background information, in our discussion, @trstruth highlighted issues about multiple retries defined in more than one task transition, how to handle them in the conductor, and how to fail the workflow (or do other logging/cleanup/remediation steps) after retries are exhausted.
Thanks for the feedback. I could open a separate issue if you like.
Cheers,
JP
On Mar 28, 2019, at 4:41 PM, W Chan notifications@github.com wrote:
@trstruth https://github.com/trstruth Thank you for identifying the use cases, issues with the original proposal, and proposing the alternative solution. I'm looking forward to the PR.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/StackStorm/orquesta/issues/118#issuecomment-477762746, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDGeixF5KtpCnVTV0jBtLrkVxdCNfIzks5vbSjigaJpZM4aR2ha.
We also look forward to this feature becoming available.
@trstruth Is this under development?
Working on adding the retry
syntax to the TaskSpec.
I've taken what @trstruth and @m4dcoder described in their last messages and think it's a great start.
My only question is, should the count
property have a special value of 0
or -1
that means "retry forever"?
I'm ok having -1
to mean retry forever and then 0 means no retry in case someone wants to pass a Jinja/YAQL expression that decide not to retry. When we integrate this back into st2, we need to make sure there's a test case to be able to cancel such task that is retrying forever.
Just putting this up for consideration - I think retry forever is a bad pattern. The concept of "infinite retries" probably shouldn't actually be captured by a retry, but rather baked into the workflow logic itself. Couldn't one simply create a task transition that recursively references the same task to create the equivalent of an infinite retry?
I'm not a fan of using count: 0
or count: -1
to indicate "no retry" or "retry forever", because that makes a user has to go look up documentation on what those special values mean. This makes workflow definitions more opaque to users as they read through it.
Instead of using special values, I would simply use the string none
or null
value to indicate disabling retry, and the string forever
or the string infinity
to indicate infinite retry.
And while I would tend to agree with @trstruth that retry forever is a bad workflow pattern, I would expect some of our users will require that.
And retries can (and for some of our internal workflows, already are) be implemented in the workflow logic itself:
vars:
- do_thing_retry_delay: 10 # seconds
- do_thing_count: 5
- do_thing_iteration: 0
tasks:
do_thing:
action: ...
next:
# try again
- when: <% failed() and ctx().do_thing_iteration < ctx().do_thing_count %>
publish:
- do_thing_iteration: <% ctx().do_thing_iteration + 1 %>
do: sleep_and_try_again
- when: <% failed() and ctx().do_thing_iteration >= ctx().do_thing_count %>
do: fail_entire_workflow
- when: <% succeeded() %>
do: next_thing
sleep_and_try_again:
action: core.local
input:
command: sleep <% ctx().do_thing_retry_delay %>
next:
- do: do_thing
That retry logic is difficult to parse and not straightforward.
Another problem is that the execution history is expanded every time the retry task is executed:
* do_thing # failed
* sleep_and_try_again # retry
* do_thing # failed
* sleep_and_try_again # retry
* do_thing # succeeded
* next_thing
At least in Mistral*, when retry
is used explicitly the failed tasks don't appear in the execution history:
* do_thing # failed, failed, succeeded
* next_thing
* I'm not saying we have to implement things exactly like Mistral.
Would it make sense to have a global setting that times out forever after x minutes?
Cheers,
JP
On Jul 23, 2019, at 5:46 PM, Drew H notifications@github.com wrote:
I'm not a fan of using count: 0 or count: -1 to indicate "no retry" or "retry forever", because that makes a user has to go look up documentation on what those special values mean. This makes workflow definitions more opaque to users as they read through it.
Instead of using special values, I would simply use the string none or null value to indicate disabling retry, and the string forever or the string infinity to indicate infinite retry.
And while I would tend to agree with @trstruth that retry forever is a bad workflow pattern, I would expect some of our users will require that.
And retries can (and for some of our internal workflows, already are) be implemented in the workflow logic itself:
vars:
- do_thing_retry_delay: 10 # seconds
- do_thing_count: 5
- do_thing_iteration: 0 tasks: do_thing: action: ... next:
try again
- when: <% failed() and ctx().do_thing_iteration < ctx().do_thing_count %> publish:
- do_thing_iteration: <% ctx().do_thing_iteration + 1 %> do: sleep_and_try_again
- when: <% failed() and ctx().do_thing_iteration >= ctx().do_thing_count %> do: fail_entire_workflow
- when: <% succeeded() %> do: next_thing sleep_and_try_again: action: core.local input: command: sleep <% ctx().do_thing_retry_delay %> next:
- do: do_thing That retry logic is difficult to parse and not straightforward.
Another problem is that the execution history is expanded every time the retry task is executed:
do_thing # failed
sleep_and_try_again # retry
do_thing # failed
sleep_and_try_again # retry
do_thing # succeeded
next_thing At least in Mistral*, when retry is used explicitly the failed tasks don't appear in the execution history:
do_thing # failed, failed, succeeded
next_thing
I'm not saying we have to implement things exactly like Mistral.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
SUMMARY
When orquesta workflow tasks fail, there should be a built-in syntax solution for retrying. Today our production workflows make use of a combination of decrementing then publishing retry counter var, and before routing to the task again. Retry is listed as an upcoming feature in the documentation and I'm interested in helping with its implementation.
ISSUE TYPE