StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.09k stars 746 forks source link

Orquesta: Potential Race Condition in Nested Inquiries #4865

Open trstruth opened 4 years ago

trstruth commented 4 years ago

SUMMARY

There appears to be a race condition in the way orquesta handles unpausing of a parent workflow when a child workflow has an inquiry which receives a response. After responding to the inquiry in the child workflow, the parent workflow can occasionally get "stuck" in a paused or a running state.

STACKSTORM VERSION

st2 3.1.0, on Python 2.7.6

OS, environment, install method

clean st2-docker installation

Steps to reproduce the problem

For a minimal reproduction, we require 2 workflows, as there needs to be a parent child relationship. The parent workflow will have only 1 task, whose action will be the child workflow. The child workflow will also have 1 task, which is an inquiry. The task transition logic in the child workflow is such that if a false response is sent to the inquiry, the child workflow will fail, causing the parent workflow to succeed. This is the expected behavior. This table shows the states of the parent and child workflows when the expected behavior is exhibited: Parent Child Child Inquiry
requested
scheduled
running
requested
scheduled
running
requested
scheduled
running
pending
pausing
paused
pausing
paused
INQUIRY RESPONSE HERE
succeeded
resuming
running
succeeded
resuming
running
succeeded
However if the inquiry response happens before the parent workflow can enter a paused state, it can cause unexpected behavior. Consider the following: Parent Child Child Inquiry
requested
scheduled
running
requested
scheduled
running
requested
scheduled
running
pending
INQUIRY RESPONSE HERE
succeeded
pausing
paused
pausing
paused

Here is a detailed set of steps you can follow to reproduce the issue:

  1. Follow the instructions in the st2-docker repository to create a clean deployment of stackstorm.
  2. install my report pack into the deployment. (Probably easiest to clone the repo in the packs.dev directory)
  3. register the newly installed pack
  4. First, we will run the workflow to see the expected behavior (the behavior shown in the first table above). On the stackstorm command line:
    
    root@e2c8427e5731:~# st2 run bug.pause-parent
    .
    id: 5e4493ba0cdfc002b6689669
    action.ref: bug.pause-parent
    parameters: None
    status: paused
    start_timestamp: Thu, 13 Feb 2020 00:09:30 UTC
    end_timestamp:
    result:
    output: null
    +-----------------------------+---------+-------+-----------------+-------------------------------+
    | id                          | status  | task  | action          | start_timestamp               |
    +-----------------------------+---------+-------+-----------------+-------------------------------+
    | + 5e4493ba0cdfc0003a644e0d  | paused  | start | bug.pause-child | Thu, 13 Feb 2020 00:09:30 UTC |
    |    5e4493bb0cdfc0003a644e10 | pending | wait  | core.ask        | Thu, 13 Feb 2020 00:09:31 UTC |
    +-----------------------------+---------+-------+-----------------+-------------------------------+
    root@e2c8427e5731:~# st2 inquiry respond 5e4493bb0cdfc0003a644e10
    custom_inner_pause (boolean): False

Response accepted for inquiry 5e4493bb0cdfc0003a644e10. root@e2c8427e5731:~# st2 execution get 5e4493ba0cdfc002b6689669 id: 5e4493ba0cdfc002b6689669 action.ref: bug.pause-parent parameters: None status: succeeded (14s elapsed) start_timestamp: Thu, 13 Feb 2020 00:09:30 UTC end_timestamp: Thu, 13 Feb 2020 00:09:44 UTC result: output: null +-----------------------------+-------------------------+-------+-----------------+-------------------------------+ | id | status | task | action | start_timestamp | +-----------------------------+-------------------------+-------+-----------------+-------------------------------+ | + 5e4493ba0cdfc0003a644e0d | succeeded (14s elapsed) | start | bug.pause-child | Thu, 13 Feb 2020 00:09:30 UTC | | 5e4493bb0cdfc0003a644e10 | succeeded (13s elapsed) | wait | core.ask | Thu, 13 Feb 2020 00:09:31 UTC | +-----------------------------+-------------------------+-------+-----------------+-------------------------------+

As you can see, the workflow succeeded.  *This is the desired behavior*
5. Now we will run a script which spam replies to inquiries, as to exhibit the race condition described above (the behavior shown in the second table above).  The script is in the `bin` directory of the bug pack.  Run it in the background (note that this will attempt to respond to all inquiries, so do not run this in a stackstorm environment where you have inquiries you want paused.)

root@e2c8427e5731:/opt/stackstorm/packs.dev/bug/bin# ./run.sh ./run.sh: line 3: ./autoresponder.py: No such file or directory /opt/stackstorm/st2/local/lib/python2.7/site-packages/cryptography/hazmat/primitives/constant_time.py:26: CryptographyDeprecationWarning: Support for your Python version is deprecated. The next version of cryptography will remove support. Please upgrade to a 2.7.x release that supports hmac.compare_digest as soon as possible. utils.PersistentlyDeprecated2018,

This will hang, waiting for inquiries it can respond to.
6.  In a new window, run the repro workflow again.  The script should instantly respond to the new inquiry, and the repro workflow will be in a paused state, as opposed to succeeded:

root@e2c8427e5731:/# st2 run bug.pause-parent .. id: 5e4496b00cdfc002b668966c action.ref: bug.pause-parent parameters: None status: paused start_timestamp: Thu, 13 Feb 2020 00:22:08 UTC end_timestamp: result: output: null +-----------------------------+------------------------+-------+-----------------+-------------------------------+ | id | status | task | action | start_timestamp | +-----------------------------+------------------------+-------+-----------------+-------------------------------+ | + 5e4496b00cdfc0003a644e13 | paused | start | bug.pause-child | Thu, 13 Feb 2020 00:22:08 UTC | | 5e4496b10cdfc0003a644e16 | succeeded (0s elapsed) | wait | core.ask | Thu, 13 Feb 2020 00:22:09 UTC | +-----------------------------+------------------------+-------+-----------------+-------------------------------+


## Expected Results
The workflow should always succeed, no matter the order in which the inquiry receives a response.

## Actual Results
The workflow gets "stuck" in a paused state.  I recognize that this appears to be a contrived example but as the complexity of the workflows/nesting grows, this issue becomes more and more prevalent.  We see behavior like this nearly daily in some of our production workflows.  We have built hacky workarounds to deal with this behavior but I believe it needs to be root caused.  This is a minimal repro but I hope its illustrative.

Please let me know if there are any issues with the workflows/scripts, or if any further clarification is needed.
trstruth commented 4 years ago

By following the steps outlined above and following the workflowengine logs, I've extracted snippets which I think illustrate the race condition I'm describing above. I've removed the timestamp information for readability but links to the entire log of both can be found at the bottom. Comments denoted by # are my own.

The workflow behaving as expected (no instant inquiry response script):

INFO workflows [-] [5e4b1b0da72c7d00306e5709] No tasks identified to execute next.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "scheduled" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "running" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "pending" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Handling pending of action execution "5e4b1b0ea72c7d00306e570c" for task "wait", route "0".
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Publish task "wait", route "0", with status "pending" to conductor.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Updating workflow execution from status "running" to "paused".
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Action execution "5e4b1b0da72c7d00306e5709" for task "start" is updated and in "paused" state.
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Handling pause of action execution "5e4b1b0da72c7d00306e5709" for task "start", route "0".
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Publish task "start", route "0", with status "paused" to conductor.
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Updating workflow execution from status "running" to "paused".    # start pauses before inquiry gets a response
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Action execution "5e4b1b0da72c7d00306e5709" for task "start" is updated and in "resuming" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "succeeded" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Processing request for workflow execution.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Identifying next set (0) of tasks for workflow execution in status "resuming".

The bugged behavior (inquiry response landing before parent workflow can unpause):

INFO workflows [-] [5e4b1befa72c7d00306e570f] No tasks identified to execute next.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "scheduled" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "running" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "pending" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Handling pending of action execution "5e4b1bf0a72c7d00306e5712" for task "wait", route "0".
INFO workflows [-] [5e4b1befa72c7d00306e570f] Publish task "wait", route "0", with status "pending" to conductor.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Updating workflow execution from status "running" to "paused".
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "succeeded" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Handling completion of action execution "5e4b1bf0a72c7d00306e5712" in status "succeeded" for task "wait", route "0".    # wait succeeds before task start goes into paused state
INFO workflows [-] [5e4b1beea72c7d014d981807] Action execution "5e4b1befa72c7d00306e570f" for task "start" is updated and in "paused" state.
INFO workflows [-] [5e4b1beea72c7d014d981807] Handling pause of action execution "5e4b1befa72c7d00306e570f" for task "start", route "0".
INFO workflows [-] [5e4b1beea72c7d014d981807] Publish task "start", route "0", with status "paused" to conductor.
INFO workflows [-] [5e4b1beea72c7d014d981807] Updating workflow execution from status "running" to "paused".
INFO workflows [-] [5e4b1befa72c7d00306e570f] Publish task "wait", route "0", with status "succeeded" to conductor.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Identifying next set (0) of tasks after completion of task "wait", route "0".
INFO workflows [-] [5e4b1befa72c7d00306e570f] No tasks identified to execute next.

@m4dcoder these logs seem consistent with my theory of the existence of a race. Could a potential solution be to "lock" a workflow and its children until the pausing/unpausing states propagate completely?

complete logs: correct: https://gist.github.com/trstruth/25647306f41ad5bb7893c6a851cf8325 incorrect: https://gist.github.com/trstruth/aacce99d7314658b12fa2e7c5ae5a8c1

i-cooltea commented 4 years ago

Hi, @trstruth
I have a problem similar to yours, Can you describe your simple solution?

trstruth commented 4 years ago

@i-cooltea I don't know of an existing solution to this problem, maybe @m4dcoder or @armab could weigh in?

i-cooltea commented 4 years ago

@trstruth

I mean what you said that 'We have built hacky workarounds to deal with this behavior '

DorAbu commented 2 years ago

We found a solution.

We resolved the problem by checking the parent workflow of the inquiry, waiting until it will reach a stable state, and then we initiate the inquiry.

We are getting the parent ID of the inquiry with the inquiry ID and https://api.stackstorm.com/api/v1/executions/#/action_executions_controller.get_one API call, taking the “parent” attribute. Then we are checking if the parent is stable with the parent ID and https://api.stackstorm.com/api/v1/executions/#/action_executions_controller.get_one API call, taking the “status” attribute. waiting until it will reach a stable state and then proceeding to initiate the inquiry.

@armab @m4dcoder

This should be fixed in the infrastructure by adding this mechanism before the inquiries initiation in this endpoint https://api.stackstorm.com/api/v1/inquiries/#/inquiries_controller.put.