StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.05k stars 745 forks source link

Inquiries should not be responded when root workflow is still in pausing state #4135

Open shusugmt opened 6 years ago

shusugmt commented 6 years ago
ISSUE TYPE
STACKSTORM VERSION
st2 2.7.2, on Python 2.7
OS / ENVIRONMENT / INSTALL METHOD
SUMMARY

I'm facing the issue that in the mistral workflow there is core.ask action, and when I respond that with st2 inquiry respond too fast, the parent workflow state is sometimes still in pausing state (that means it is still transitioning from running to pause), and it fails to resume the parent root workflow saying ERROR: 500 Server Error: Internal Server Error but the inquiry itself is already in success state and you can't respond it again to resume the parent entire workflow.

The workaround is, wait 10 sec before you respond. I figured it out that this seems tp be just random. Sometimes the state of root workflow is set to paused properly, sometimes not and stuck in pausing. see #4138

If you face the above error, you can just st2 execution resume to resume the parent workflow, but this apparently is not the desired situation.

Relevant code: https://github.com/StackStorm/st2/blob/master/st2api/st2api/controllers/exp/inquiries.py#L174 https://github.com/StackStorm/st2/blob/master/st2common/st2common/services/action.py#L320

Here's the log: you can see that LiveAction id 5b02dcfc069c4301d1f654ca was in pausing state when the inquiry respond was trying to resume it.

2018-05-21 14:53:31,110 139685573276848 AUDIT auth [-] Token with id "5b02cee5069c4301cdba1881" is validated.
2018-05-21 14:53:31,189 139685573276848 INFO loader [-] Reusing runner module "/opt/stackstorm/runners/inquirer_runner/inquirer_runner/inquirer_runner.py" from cache.
2018-05-21 14:53:31,198 139685573276848 INFO loader [-] Reusing callback module "/opt/stackstorm/runners/mistral_v2/callback/mistral_v2.py" from cache.
2018-05-21 14:53:31,198 139685573276848 INFO mistral_v2 [-] Sending callback to http://0.0.0.0:8989/v2/action_executions/ac37214c-2c2a-489b-82f4-4de74e5665d6 with data {'output': '{"users": [], "roles": [], "route": "", "ttl": 1440, "response": {"continue": true}, "schema": {"type": "object", "properties": {"continue": {"required": true, "type": "boolean", "description": "Would you like to continue the workflow?"}}, "title": "response_data"}}', 'state': 'SUCCESS'}.
2018-05-21 14:53:31,309 139685573276848 ERROR router [-] Failed to call controller function "put" for operation "st2api.controllers.exp.inquiries:inquiries_controller.put": Unable to resume liveaction "5b02dcfc069c4301d1f654ca" because it is not in a paused state.
Traceback (most recent call last):
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/router.py", line 470, in __call__
    resp = func(**kw)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2api/controllers/exp/inquiries.py", line 176, in put
    requester_user
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/services/action.py", line 323, in request_resume
    % liveaction.id
UnexpectedActionExecutionStatusError: Unable to resume liveaction "5b02dcfc069c4301d1f654ca" because it is not in a paused state.
2018-05-21 14:53:31,309 139685573276848 ERROR error_handling [-] API call failed: Unable to resume liveaction "5b02dcfc069c4301d1f654ca" because it is not in a paused state.
Traceback (most recent call last):
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/middleware/error_handling.py", line 47, in __call__
    return self.app(environ, start_response)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/middleware/streaming.py", line 48, in __call__
    return self.app(environ, start_response)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/router.py", line 519, in as_wsgi
    resp = self(req)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/router.py", line 474, in __call__
    raise e
UnexpectedActionExecutionStatusError: Unable to resume liveaction "5b02dcfc069c4301d1f654ca" because it is not in a paused state. (_exception_data={},_exception_class='UnexpectedActionExecutionStatusError',_exception_message='Unable to resume liveaction "5b02dcfc069c4301d1f654ca" because it is not in a paused state.')
2018-05-21 14:53:31,310 139685573276848 INFO logging [-] 128c4d50-cf1d-433a-bd07-3c26cdfa5b2a - 500 46 202.513ms (content_length=46,request_id='128c4d50-cf1d-433a-bd07-3c26cdfa5b2a',runtime=202.513,remote_addr='127.0.0.1',status=500,method='PUT',path='/exp/inquiries/5b02dd2e069c4301d1f654ea')
2018-05-21 14:53:31,407 139685573276848 INFO logging [-] 52742bd7-8e82-4519-9f69-8e9756d9e519 - PUT /v1/executions/5b02dcfc069c4301d1f654ce with query={} (remote_addr='127.0.0.1',method='PUT',request_id='52742bd7-8e82-4519-9f69-8e9756d9e519',query={},path='/v1/executions/5b02dcfc069c4301d1f654ce')
2018-05-21 14:53:31,410 139685573276848 AUDIT auth [-] Token with id "5b02dcfd069c43014e79ecf4" is validated.
2018-05-21 14:53:31,457 139685573276848 INFO logging [-] 52742bd7-8e82-4519-9f69-8e9756d9e519 - 200 8779 49.281ms (content_length=8779,request_id='52742bd7-8e82-4519-9f69-8e9756d9e519',runtime=49.281,remote_addr='127.0.0.1',status=200,method='PUT',path='/v1/executions/5b02dcfc069c4301d1f654ce')
2018-05-21 14:53:31,489 139685573276848 INFO logging [-] 59ea2c74-7b6d-43a1-aad0-98ed27190391 - PUT /v1/executions/5b02dcfc069c4301d1f654cb with query={} (remote_addr='127.0.0.1',method='PUT',request_id='59ea2c74-7b6d-43a1-aad0-98ed27190391',query={},path='/v1/executions/5b02dcfc069c4301d1f654cb')
2018-05-21 14:53:31,491 139685573276848 AUDIT auth [-] Token with id "5b02dcfc069c430140a1b0e4" is validated.
2018-05-21 14:53:31,577 139685573276848 AUDIT action [-] The status of action execution is changed from pausing to paused. <LiveAction.id=5b02dcfc069c4301d1f654ca, ActionExecution.id=5b02dcfc069c4301d1f654cb> ...
kaishuu0123 commented 6 years ago

fix pseudo code

parent = liveaction_db.context.get("parent")
if parent:
  parent_execution = ActionExecution.get(id=parent['execution_id'])
  parent_liveaction = LiveAction.get(id=parent_execution.liveaction['id'])
  # Check parent liveaction status
  if parent_liveaction.status != action_constants.LIVEACTION_STATUS_PAUSED:
    # RFC: better error message
    abort(http_client.BAD_REQUEST, 'Unable to resume liveaction "%s" because it is not in a paused state.' % parent_liveaction.id)
shusugmt commented 6 years ago

@kaishuu0123 This won't work since liveaction_db is not accessible where the line this check logic should be brought into. I'm looking at here before https://github.com/StackStorm/st2/blob/master/st2api/st2api/controllers/exp/inquiries.py#L135

dbillor commented 6 years ago

Seeing same issue are there any good workarounds currently? Would be of great help as we're seeing this happen across our wfs.

Noticed that wf will still be running and this is consistently reproduce able on workflows that were left pending/paused at the inquiry state for longer than 4 days.

Update turns out reason they arent resuming is due to the inquiry default timeout of 24 hours i will confirm it is working after timeout is changed

stale[bot] commented 5 years ago

Thanks for contributing to this issue. As it has been 90 days since the last activity, we are automatically marking is as stale. If this issue is not relevant or applicable anymore (problem has been fixed in a new version or similar), please close the issue or let us know so we can close it. On the contrary, if the issue is still relevant, there is nothing you need to do, but if you have any additional details or context which would help us when working on this issue, please include it as a comment to this issue.