Closed nzlosh closed 6 years ago
Yeah, we typically have some workflows "stuck" in the RUNNING state even according to mistral logs it succeeded. Feels like either callback to st2 doesn't get to it or st2 gives up polling state of the task. Not sure how execution is being controlled by st2.
@nzlosh I'll be trying to repro it with 2.3dev this week. Meanwhile, could you give us the following info?
st2 2.1.1
mistral task-get 0986cc88-4696-4890-8b74-0cc9c0b552b9
+---------------+------------------------------------------------------+
| Field | Value |
+---------------+------------------------------------------------------+
| ID | 0986cc88-4696-4890-8b74-0cc9c0b552b9 |
| Name | check_host_health |
| Workflow name | st2dm_upgrade_pkg.post_upgrade_pkg.check_environment |
| Execution ID | 1417305b-288d-40f7-b7fc-324ba897fd3a |
| State | RUNNING |
| State info | None |
+---------------+------------------------------------------------------+
mistral execution-get 1417305b-288d-40f7-b7fc-324ba897fd3a
+-------------------+------------------------------------------------------+
| Field | Value |
+-------------------+------------------------------------------------------+
| ID | 1417305b-288d-40f7-b7fc-324ba897fd3a |
| Workflow ID | 9e4b67cd-b869-46b2-91c6-3d98552fc7af |
| Workflow name | st2dm_upgrade_pkg.post_upgrade_pkg.check_environment |
| Description | |
| Task Execution ID | <none> |
| State | ERROR |
| State info | list index out of range |
| Created at | 2017-02-24 16:43:05.524806 |
| Updated at | 2017-02-24 16:43:07.100467 |
+-------------------+------------------------------------------------------+
I noticed that it happens when I have issues with mistral workers, similar to this:
[2017-02-23 15:31:41 +0000] [26755] [INFO] Starting gunicorn 19.6.0
[2017-02-23 15:31:41 +0000] [26755] [INFO] Listening at: http://127.0.0.1:8989 (26755)
[2017-02-23 15:31:41 +0000] [26755] [INFO] Using worker: sync
[2017-02-23 15:31:41 +0000] [26773] [INFO] Booting worker with pid: 26773
[2017-02-23 15:31:41 +0000] [26776] [INFO] Booting worker with pid: 26776
[2017-02-24 09:42:18 +0000] [26755] [CRITICAL] WORKER TIMEOUT (pid:26776)
[2017-02-24 09:42:18 +0000] [26755] [CRITICAL] WORKER TIMEOUT (pid:26773)
[2017-02-24 09:42:18 +0000] [26773] [INFO] Worker exiting (pid: 26773)
[2017-02-24 09:42:18 +0000] [26776] [INFO] Worker exiting (pid: 26776)
[2017-02-24 09:42:18 +0000] [18953] [INFO] Booting worker with pid: 18953
[2017-02-24 09:42:18 +0000] [18954] [INFO] Booting worker with pid: 18954
[2017-02-24 09:42:51 +0000] [26755] [CRITICAL] WORKER TIMEOUT (pid:18954)
[2017-02-24 09:42:51 +0000] [18954] [INFO] Worker exiting (pid: 18954)
[2017-02-24 09:42:51 +0000] [19010] [INFO] Booting worker with pid: 19010
I enabled debug, made some other tweaks to postgresql and gunicorn command line arguments, no issues ever since. Watching the logs, once it happens, hoping to have more debug.
st2 version: 2.1.1
Soon gonna upgrade to the latest.
Today it happened once. Oddly enough, no critical or any errors in the mistral-api.log. In the history of that workflow it successfully finished a task and never proceeded to the next one.
In every log I can see that it posted successful result of the task, but can't find who or what should've schedule the next one. Where should I be looking for it?
Upon further investigation I discovered that the workflow continued to be executed by the mistral server.
And it failed to handle action execution result for one of the next tasks. I am gonna post it here since it doesn't provide any explanation other than traceback:
2017-02-28 12:42:10.498 14854 ERROR mistral.engine.default_engine [-] Failed to handle action execution result [id=5b3a3ab7-8b7e-4b83-a2aa-a62dc24feacd]:
Traceback (most recent call last):
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/default_engine.py", line 261, in on_action_complete
self._on_task_state_change(task_ex, wf_ex, wf_spec)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/default_engine.py", line 178, in _on_task_state_change
cmds = wf_ctrl.continue_workflow()
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/workflow/base.py", line 116, in continue_workflow
return self._find_next_commands(env=env)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/workflow/direct_workflow.py", line 82, in _find_next_commands
cmds.extend(self._find_next_commands_for_task(t_ex))
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/workflow/direct_workflow.py", line 106, in _find_next_commands_for_task
for t_n in self._find_next_task_names(task_ex):
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/workflow/direct_workflow.py", line 207, in _find_next_task_names
ctx
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/workflow/direct_workflow.py", line 231, in _find_next_task_names_for_clause
if not condition or expr.evaluate(condition, ctx)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/expressions.py", line 180, in evaluate
return _EVALUATOR.evaluate(expression, context)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/expressions.py", line 146, in evaluate
cls).evaluate(trim_expr, data_context)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/expressions.py", line 90, in evaluate
context=yaql_utils.get_yaql_context(data_context)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/expressions.py", line 165, in evaluate
return self(utils.NO_VALUE, context, self.engine)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/expressions.py", line 158, in __call__
six.reraise(type(e.wrapped), e.wrapped, sys.exc_info()[2])
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/expressions.py", line 156, in __call__
return super(Statement, self).__call__(receiver, context, engine)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/expressions.py", line 37, in __call__
return context(self.name, engine, receiver, context)(*self.args)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/contexts.py", line 65, in <lambda>
data_context, use_convention, function_filter)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 49, in call
name, all_overloads, engine, receiver, data_context, args, kwargs)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 117, in choose_overload
args = tuple(arg_evaluator(i, arg) for i, arg in enumerate(args))
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 117, in <genexpr>
args = tuple(arg_evaluator(i, arg) for i, arg in enumerate(args))
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 113, in <lambda>
and not isinstance(arg, expressions.Constant))
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/expressions.py", line 37, in __call__
return context(self.name, engine, receiver, context)(*self.args)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/contexts.py", line 65, in <lambda>
data_context, use_convention, function_filter)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 49, in call
name, all_overloads, engine, receiver, data_context, args, kwargs)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 117, in choose_overload
args = tuple(arg_evaluator(i, arg) for i, arg in enumerate(args))
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 117, in <genexpr>
args = tuple(arg_evaluator(i, arg) for i, arg in enumerate(args))
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 113, in <lambda>
and not isinstance(arg, expressions.Constant))
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/expressions.py", line 37, in __call__
return context(self.name, engine, receiver, context)(*self.args)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/contexts.py", line 65, in <lambda>
data_context, use_convention, function_filter)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 51, in call
result = delegate()
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 142, in <lambda>
return lambda: delegate()
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/specs.py", line 341, in func
six.iteritems(keyword_args)))
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/standard_library/system.py", line 34, in op_dot
return expr(receiver)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/yaqltypes.py", line 269, in func
engine, args, kwargs)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/yaqltypes.py", line 236, in _call
result = value(receiver, context, engine)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/expressions.py", line 37, in __call__
return context(self.name, engine, receiver, context)(*self.args)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/contexts.py", line 65, in <lambda>
data_context, use_convention, function_filter)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 58, in call
sys.exc_info()[2])
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 51, in call
result = delegate()
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/runner.py", line 142, in <lambda>
return lambda: delegate()
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/language/specs.py", line 341, in func
six.iteritems(keyword_args)))
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/yaql/standard_library/queries.py", line 227, in first
return six.next(iter(collection))
StopIteration
2017-02-28 12:42:10.535 14854 INFO workflow_trace [-] Execution of workflow 'st2microservices.deploy' [RUNNING -> ERROR] (execution_id=fdfc6230-abb4-4b60-8e7d-5d9f9ae938f0)
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server [-] Exception during message handling
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server File "/opt/stackstorm/mistral/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server File "/opt/stackstorm/mistral/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server File "/opt/stackstorm/mistral/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/rpc.py", line 166, in on_action_complete
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server return self._engine.on_action_complete(action_ex_id, result)
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/utils/__init__.py", line 112, in _logged
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server return func(*args, **kw)
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/default_engine.py", line 275, in on_action_complete
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server raise e
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server StopIteration
2017-02-28 12:42:10.542 14854 ERROR oslo_messaging.rpc.server
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor [-] Exception occurred when calling engine on_action_complete [action_ex_id=5b3a3ab7-8b7e-4b83-a2aa-a62dc24feacd, action_cls='<class 'mistral.actions.action_factory.NoOpAction'>', attributes='{}', params='{}']
StopIteration:
Traceback (most recent call last):
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
res = self.dispatcher.dispatch(message)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
return self._do_dispatch(endpoint, method, ctxt, args)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
result = func(ctxt, **new_args)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/rpc.py", line 166, in on_action_complete
return self._engine.on_action_complete(action_ex_id, result)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/utils/__init__.py", line 112, in _logged
return func(*args, **kw)
File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/default_engine.py", line 275, in on_action_complete
raise e
StopIteration
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor Traceback (most recent call last):
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/default_executor.py", line 96, in run_action
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor self._engine_client.on_action_complete(action_ex_id, result)
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/rpc.py", line 284, in decorator
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor _wrap_exception_and_reraise(e)
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/rpc.py", line 259, in _wrap_exception_and_reraise
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor raise exc.MistralException(message)
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor MistralException: StopIteration:
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor Traceback (most recent call last):
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor File "/opt/stackstorm/mistral/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor res = self.dispatcher.dispatch(message)
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor File "/opt/stackstorm/mistral/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor return self._do_dispatch(endpoint, method, ctxt, args)
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor File "/opt/stackstorm/mistral/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor result = func(ctxt, **new_args)
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/rpc.py", line 166, in on_action_complete
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor return self._engine.on_action_complete(action_ex_id, result)
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/utils/__init__.py", line 112, in _logged
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor return func(*args, **kw)
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/engine/default_engine.py", line 275, in on_action_complete
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor raise e
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor StopIteration
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor
2017-02-28 12:42:10.588 14854 ERROR mistral.engine.default_executor
Any ideas what component failed? Any possible reasons for such errors?
@nzlosh I couldn't repro the bug with 2.3dev which sets the state to ERROR and also the execution status in st2 gets set to failed correctly. https://gist.github.com/lakshmi-kannan/dd6aabc80dd864f71fe9d81eeda5a38c. You might be running into the same issue as emptywee. There have been some fixes to the mistral side but I cannot be certain if his issue has been fixed. I'll spend some more time on this tomorrow but meanwhile can you edit the mistral-basic example workflow to what I have and see if the bug shows up?
@emptywee are the errors you posted related to list index out of range
or are they all errors related to tasks remaining in RUNNING
state?
@lakshmi-kannan the use case posted in gist isn't quite the same as the one I posted. Here is a simple workflow to reproduce the error
---
name: yaql_out_of_range
pack: test
description: "Provoke a list index out of range error"
runner_type: mistral-v2
entry_point: workflow/yaql_out_of_range.yaml
enabled: True
parameters:
list_param:
type: array
default: []
description: An empty list.
version: '2.0'
test.yaql_out_of_range:
description: test yaql list index out of range.
type: direct
input:
- list_param
tasks:
read_consul:
action: core.local
input:
cmd: "echo <% list_param %>"
publish:
bad_yaql_ref: <% $.list_param[0][1] %>
on-success: good
on-error: bad
good:
action: core.noop
bad:
action: core.noop
@nzlosh I tested with your workflow (thank you for a repro case) and I couldn't repro. See https://gist.github.com/lakshmi-kannan/287bb6dcdf0316e913ba72d49f42587d. The workflow fails and st2 shows status as failed. The YAQL error doesn't bubble up which is sad. I thought we fixed this but looks like NO. I also posted the output of mistral execution-get. FWIW, I am running 2.3dev which is almost same as 2.2.
@nzlosh nope, my workflows typically just get stuck randomly. I provided the only evidence of something abnormal. Waiting for the stackstorm guys to look into traceback and give me a clue which component failed and because of what possible reason.
@lakshmi-kannan if st2 is show the task as failed, that seems like the correct behaviour, this would suggest that task doesn't remain in a stuck state. I've just upgraded production to 2.2 so run the test case against 2.2 and let you know.
I upgraded st2 and now get and error instead of the task remaining in RUNNING.
st2 2.3dev
However the error, isn't related to the index out of range, like your gist output 287bb6dcdf0316e913ba72d49f42587d.
I get the following error
"error": "HTTPConnectionPool(host='127.0.0.1', port=8989): Max retries exceeded with url: /v2/workflows (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f9213443dd0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))"
Just hit the same condition, good news is "I can reproduce it now". Will keep researching. cc @m4dcoder
Just want to mention that Ron from SQA here also ran into this issue.
Mistral callbacks, etc have changed a lot in the last 12 months. Going to close this since there's no current, consistent way of reproducing it.
A task was marked as RUNNING despite extra information being available indicating an error occurred.
The task that was called is shown below and the error is referring to mistral being unable to publish the variable because YAQL detected the list index out of range.: