Open trstruth opened 4 years ago
By following the steps outlined above and following the workflowengine logs, I've extracted snippets which I think illustrate the race condition I'm describing above. I've removed the timestamp information for readability but links to the entire log of both can be found at the bottom. Comments denoted by #
are my own.
The workflow behaving as expected (no instant inquiry response script):
INFO workflows [-] [5e4b1b0da72c7d00306e5709] No tasks identified to execute next.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "scheduled" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "running" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "pending" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Handling pending of action execution "5e4b1b0ea72c7d00306e570c" for task "wait", route "0".
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Publish task "wait", route "0", with status "pending" to conductor.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Updating workflow execution from status "running" to "paused".
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Action execution "5e4b1b0da72c7d00306e5709" for task "start" is updated and in "paused" state.
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Handling pause of action execution "5e4b1b0da72c7d00306e5709" for task "start", route "0".
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Publish task "start", route "0", with status "paused" to conductor.
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Updating workflow execution from status "running" to "paused". # start pauses before inquiry gets a response
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Action execution "5e4b1b0da72c7d00306e5709" for task "start" is updated and in "resuming" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "succeeded" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Processing request for workflow execution.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Identifying next set (0) of tasks for workflow execution in status "resuming".
The bugged behavior (inquiry response landing before parent workflow can unpause):
INFO workflows [-] [5e4b1befa72c7d00306e570f] No tasks identified to execute next.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "scheduled" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "running" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "pending" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Handling pending of action execution "5e4b1bf0a72c7d00306e5712" for task "wait", route "0".
INFO workflows [-] [5e4b1befa72c7d00306e570f] Publish task "wait", route "0", with status "pending" to conductor.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Updating workflow execution from status "running" to "paused".
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "succeeded" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Handling completion of action execution "5e4b1bf0a72c7d00306e5712" in status "succeeded" for task "wait", route "0". # wait succeeds before task start goes into paused state
INFO workflows [-] [5e4b1beea72c7d014d981807] Action execution "5e4b1befa72c7d00306e570f" for task "start" is updated and in "paused" state.
INFO workflows [-] [5e4b1beea72c7d014d981807] Handling pause of action execution "5e4b1befa72c7d00306e570f" for task "start", route "0".
INFO workflows [-] [5e4b1beea72c7d014d981807] Publish task "start", route "0", with status "paused" to conductor.
INFO workflows [-] [5e4b1beea72c7d014d981807] Updating workflow execution from status "running" to "paused".
INFO workflows [-] [5e4b1befa72c7d00306e570f] Publish task "wait", route "0", with status "succeeded" to conductor.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Identifying next set (0) of tasks after completion of task "wait", route "0".
INFO workflows [-] [5e4b1befa72c7d00306e570f] No tasks identified to execute next.
@m4dcoder these logs seem consistent with my theory of the existence of a race. Could a potential solution be to "lock" a workflow and its children until the pausing/unpausing states propagate completely?
complete logs: correct: https://gist.github.com/trstruth/25647306f41ad5bb7893c6a851cf8325 incorrect: https://gist.github.com/trstruth/aacce99d7314658b12fa2e7c5ae5a8c1
Hi, @trstruth
I have a problem similar to yours,
Can you describe your simple solution?
@i-cooltea I don't know of an existing solution to this problem, maybe @m4dcoder or @armab could weigh in?
@trstruth
I mean what you said that 'We have built hacky workarounds to deal with this behavior '
We found a solution.
We resolved the problem by checking the parent workflow of the inquiry, waiting until it will reach a stable state, and then we initiate the inquiry.
We are getting the parent ID of the inquiry with the inquiry ID and https://api.stackstorm.com/api/v1/executions/#/action_executions_controller.get_one API call, taking the “parent” attribute. Then we are checking if the parent is stable with the parent ID and https://api.stackstorm.com/api/v1/executions/#/action_executions_controller.get_one API call, taking the “status” attribute. waiting until it will reach a stable state and then proceeding to initiate the inquiry.
@armab @m4dcoder
This should be fixed in the infrastructure by adding this mechanism before the inquiries initiation in this endpoint https://api.stackstorm.com/api/v1/inquiries/#/inquiries_controller.put.
SUMMARY
There appears to be a race condition in the way orquesta handles unpausing of a parent workflow when a child workflow has an inquiry which receives a response. After responding to the inquiry in the child workflow, the parent workflow can occasionally get "stuck" in a paused or a running state.
STACKSTORM VERSION
st2 3.1.0, on Python 2.7.6
OS, environment, install method
clean st2-docker installation
Steps to reproduce the problem
Here is a detailed set of steps you can follow to reproduce the issue:
Response accepted for inquiry 5e4493bb0cdfc0003a644e10. root@e2c8427e5731:~# st2 execution get 5e4493ba0cdfc002b6689669 id: 5e4493ba0cdfc002b6689669 action.ref: bug.pause-parent parameters: None status: succeeded (14s elapsed) start_timestamp: Thu, 13 Feb 2020 00:09:30 UTC end_timestamp: Thu, 13 Feb 2020 00:09:44 UTC result: output: null +-----------------------------+-------------------------+-------+-----------------+-------------------------------+ | id | status | task | action | start_timestamp | +-----------------------------+-------------------------+-------+-----------------+-------------------------------+ | + 5e4493ba0cdfc0003a644e0d | succeeded (14s elapsed) | start | bug.pause-child | Thu, 13 Feb 2020 00:09:30 UTC | | 5e4493bb0cdfc0003a644e10 | succeeded (13s elapsed) | wait | core.ask | Thu, 13 Feb 2020 00:09:31 UTC | +-----------------------------+-------------------------+-------+-----------------+-------------------------------+
root@e2c8427e5731:/opt/stackstorm/packs.dev/bug/bin# ./run.sh ./run.sh: line 3: ./autoresponder.py: No such file or directory /opt/stackstorm/st2/local/lib/python2.7/site-packages/cryptography/hazmat/primitives/constant_time.py:26: CryptographyDeprecationWarning: Support for your Python version is deprecated. The next version of cryptography will remove support. Please upgrade to a 2.7.x release that supports hmac.compare_digest as soon as possible. utils.PersistentlyDeprecated2018,
root@e2c8427e5731:/# st2 run bug.pause-parent .. id: 5e4496b00cdfc002b668966c action.ref: bug.pause-parent parameters: None status: paused start_timestamp: Thu, 13 Feb 2020 00:22:08 UTC end_timestamp: result: output: null +-----------------------------+------------------------+-------+-----------------+-------------------------------+ | id | status | task | action | start_timestamp | +-----------------------------+------------------------+-------+-----------------+-------------------------------+ | + 5e4496b00cdfc0003a644e13 | paused | start | bug.pause-child | Thu, 13 Feb 2020 00:22:08 UTC | | 5e4496b10cdfc0003a644e16 | succeeded (0s elapsed) | wait | core.ask | Thu, 13 Feb 2020 00:22:09 UTC | +-----------------------------+------------------------+-------+-----------------+-------------------------------+