StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.07k stars 749 forks source link

Scheduler bounces LiveActions between delayed and policy-delayed #4655

Closed nmaludy closed 5 years ago

nmaludy commented 5 years ago
SUMMARY

I noticed this when debugging another problem and wanted to say something in case this behavior wasn't intended.

When there are executions that are delayed by a policy, the st2scheduler is constantly flipping them between policy-delayed and delayed state. Each state change is accumulated in the LiveAction instance and causes fairly large objects, especially if executions are delayed for a long period of time (large backlog).

ISSUE TYPE
STACKSTORM VERSION
$ st2 --version
st2 3.0, on Python 2.7.5
OS / ENVIRONMENT / INSTALL METHOD
CentOS 7.6
Puppet-st2
STEPS TO REPRODUCE

Create the following policy `/opt/stackstorm/packs/default/policies/core_local_concurrency.yaml

pack: default
name: core.local.concurrency
description: Limits the concurrent executions for core.local
enabled: true
resource_ref: core.local
policy_type: action.concurrency
parameters:
  action: delay
  threshold: 10

Run the following script to kick off a bunch of core.local excutions

#!/bin/bash
for NUM in `seq -w 0 250`; do
  st2 action execute core.local cmd='sleep 120' timeout=180
done

Tail the st2scheduler logs

tail -f /var/log/st2/st2scheduler.log
ACTUAL RESULTS
2019-05-01 15:02:59,673 140648388594096 INFO handler [-] Scheduling liveaction: 5cc9ec569387ef586c0c82c3 (queue_item_id=5cc9ec569387ef58ba380e73) (queue_item_id='5cc9ec569387ef58ba380e73',liveaction_id='5cc9ec569387ef586c0c82c3')
2019-05-01 15:02:59,711 140648388594096 INFO policies [-] Applying policy "default.core.local.concurrency" (action.concurrency) for liveaction "5cc9ec569387ef586c0c82c3".
2019-05-01 15:02:59,848 140648388594096 AUDIT action [-] The status of action execution is changed from delayed to policy-delayed. <LiveAction.id=5cc9ec569387ef586c0c82c3, ActionExecution.id=5cc9ec569387ef586c0c82c4> (liveaction_db={'status': 'policy-delayed', 'runner_info': {}, 'workflow_execution': None, 'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, 'action_is_workflow': False, 'start_timestamp': '2019-05-01 18:58:30.534541+00:00', 'delay': None, 'callback': {}, 'task_execution': None, 'notify': None, 'result': {}, 'context': {u'user': u'st2admin', u'pack': u'core'}, 'action': u'core.local', 'id': '5cc9ec569387ef586c0c82c3', 'end_timestamp': None},action_execution_db={'status': u'policy-delayed', 'start_timestamp': '2019-05-01 18:58:30.534541+00:00', 'web_url': u'https://stackstorm.domain.tld#/history/5cc9ec569387ef586c0c82c4/general', 'log': [{u'status': u'requested', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 30, 550000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 40, 199000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 40, 370000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 55, 539000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 55, 781000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 14, 435000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 14, 762000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 38, 182000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 38, 475000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 5, 460000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 5, 955000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 34, 537000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 34, 650000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 3, 703000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 3, 919000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 33, 355000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 33, 639000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 3, 76000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 3, 286000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 31, 627000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 31, 774000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 59, 814000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb4139f090>)}], 'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, 'trigger_type': {}, 'runner': {u'runner_module': u'local_shell_command_runner', u'uid': u'runner_type:local-shell-cmd', u'runner_package': u'local_runner', u'enabled': True, u'name': u'local-shell-cmd', u'output_schema': {}, u'runner_parameters': {u'sudo': {u'default': False, u'type': u'boolean', u'description': u'The command will be executed with sudo.'}, u'sudo_password': {u'default': None, u'secret': '********', u'required': False, u'type': u'string', u'description': u'Sudo password. To be used when paswordless sudo is not allowed.'}, u'env': {u'type': u'object', u'description': u'Environment variables which will be available to the command(e.g. key1=val1,key2=val2)'}, u'cmd': {u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the host.'}, u'kwarg_op': {u'default': u'--', u'type': u'string', u'description': u'Operator to use in front of keyword args i.e. "--" or "-".'}, u'timeout': {u'default': 60, u'type': u'integer', u'description': u"Action timeout in seconds. Action will get killed if it doesn't finish in timeout seconds."}, u'cwd': {u'type': u'string', u'description': u'Working directory where the command will be executed in'}}, u'id': u'59b7d5e6a814c02c97e8a96a', u'description': u'A runner to execute local actions as a fixed user.'}, 'trigger_instance': {}, 'rule': {}, 'children': [], 'delay': None, 'task_execution': None, 'trigger': {}, 'parent': None, 'result': {}, 'context': {u'user': u'st2admin', u'pack': u'core'}, 'workflow_execution': None, 'action': {u'notify': {}, u'name': u'local', u'runner_type': u'local-shell-cmd', u'tags': [], u'description': u'Action that executes an arbitrary Linux command on the localhost.', u'enabled': True, u'entry_point': u'', u'metadata_file': u'actions/local.yaml', u'output_schema': {}, u'uid': u'action:core:local', u'parameters': {u'cmd': {u'required': True, u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the local host.'}, u'sudo': {u'immutable': True}}, u'ref': u'core.local', u'id': u'59b7d5e6a814c02c97e8a97f', u'pack': u'core'}, 'liveaction': {u'runner_info': {}, u'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, u'action_is_workflow': False, u'callback': {}, u'action': u'core.local', u'id': u'5cc9ec569387ef586c0c82c3'}, 'id': '5cc9ec569387ef586c0c82c4', 'end_timestamp': None})
2019-05-01 15:02:59,923 140648388594096 INFO action [-] The status of action execution is changed from delayed to policy-delayed. <LiveAction.id=5cc9ec569387ef586c0c82c3, ActionExecution.id=5cc9ec569387ef586c0c82c4>
2019-05-01 15:02:59,924 140648388594096 INFO handler [-] Liveaction (5cc9ec569387ef586c0c82c3) Status Pre-Run: policy-delayed (5cc9ec569387ef58ba380e73) (liveaction_status='policy-delayed',queue_item_id='5cc9ec569387ef58ba380e73',liveaction_id='5cc9ec569387ef586c0c82c3')
2019-05-01 15:02:59,983 140648388594096 AUDIT action [-] The status of action execution is changed from policy-delayed to delayed. <LiveAction.id=5cc9ec569387ef586c0c82c3, ActionExecution.id=5cc9ec569387ef586c0c82c4> (liveaction_db={'status': 'delayed', 'runner_info': {}, 'workflow_execution': None, 'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, 'action_is_workflow': False, 'start_timestamp': '2019-05-01 18:58:30.534541+00:00', 'delay': None, 'callback': {}, 'task_execution': None, 'notify': None, 'result': {}, 'context': {u'user': u'st2admin', u'pack': u'core'}, 'action': u'core.local', 'id': '5cc9ec569387ef586c0c82c3', 'end_timestamp': None},action_execution_db={'status': u'delayed', 'start_timestamp': '2019-05-01 18:58:30.534541+00:00', 'web_url': u'https://stackstorm.domain.tld/#/history/5cc9ec569387ef586c0c82c4/general', 'log': [{u'status': u'requested', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 30, 550000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 40, 199000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 40, 370000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 55, 539000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 58, 55, 781000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 14, 435000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 14, 762000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 38, 182000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 18, 59, 38, 475000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 5, 460000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 5, 955000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 34, 537000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 0, 34, 650000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 3, 703000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 3, 919000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 33, 355000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 1, 33, 639000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 3, 76000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 3, 286000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 31, 627000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 31, 774000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'policy-delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 59, 814000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}, {u'status': u'delayed', u'timestamp': datetime.datetime(2019, 5, 1, 19, 2, 59, 965000, tzinfo=<bson.tz_util.FixedOffset object at 0x7feb411bc350>)}], 'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, 'trigger_type': {}, 'runner': {u'runner_module': u'local_shell_command_runner', u'uid': u'runner_type:local-shell-cmd', u'runner_package': u'local_runner', u'enabled': True, u'name': u'local-shell-cmd', u'output_schema': {}, u'runner_parameters': {u'sudo': {u'default': False, u'type': u'boolean', u'description': u'The command will be executed with sudo.'}, u'sudo_password': {u'default': None, u'secret': '********', u'required': False, u'type': u'string', u'description': u'Sudo password. To be used when paswordless sudo is not allowed.'}, u'env': {u'type': u'object', u'description': u'Environment variables which will be available to the command(e.g. key1=val1,key2=val2)'}, u'cmd': {u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the host.'}, u'kwarg_op': {u'default': u'--', u'type': u'string', u'description': u'Operator to use in front of keyword args i.e. "--" or "-".'}, u'timeout': {u'default': 60, u'type': u'integer', u'description': u"Action timeout in seconds. Action will get killed if it doesn't finish in timeout seconds."}, u'cwd': {u'type': u'string', u'description': u'Working directory where the command will be executed in'}}, u'id': u'59b7d5e6a814c02c97e8a96a', u'description': u'A runner to execute local actions as a fixed user.'}, 'trigger_instance': {}, 'rule': {}, 'children': [], 'delay': None, 'task_execution': None, 'trigger': {}, 'parent': None, 'result': {}, 'context': {u'user': u'st2admin', u'pack': u'core'}, 'workflow_execution': None, 'action': {u'notify': {}, u'name': u'local', u'runner_type': u'local-shell-cmd', u'tags': [], u'description': u'Action that executes an arbitrary Linux command on the localhost.', u'enabled': True, u'entry_point': u'', u'metadata_file': u'actions/local.yaml', u'output_schema': {}, u'uid': u'action:core:local', u'parameters': {u'cmd': {u'required': True, u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the local host.'}, u'sudo': {u'immutable': True}}, u'ref': u'core.local', u'id': u'59b7d5e6a814c02c97e8a97f', u'pack': u'core'}, 'liveaction': {u'runner_info': {}, u'parameters': {u'cmd': u'sleep 120', u'timeout': 180}, u'action_is_workflow': False, u'callback': {}, u'action': u'core.local', u'id': u'5cc9ec569387ef586c0c82c3'}, 'id': '5cc9ec569387ef586c0c82c4', 'end_timestamp': None})
Kami commented 5 years ago

Thanks for reporting this.

This indeed looks like a bug / race.

@m4dcoder will you have time to have a look this / next week (since you are the last person who worked on that and probably have the most context , otherwise I can have a look as well).

nmaludy commented 5 years ago

@m4dcoder ^^ (@Kami mistyped your name above)

m4dcoder commented 5 years ago

This is not a race. It's more of an annoying behavior. We introduced policy-delayed when we refactored the scheduler. An action execution goes to policy-delayed when the concurrency policy enforce delay for the execution. When the scheduler sees the policy enforcement, it delays the execution. In this case, the execution are being delayed and rescheduled and delayed again by policy. We can refactor the scheduler/policies to reduce and silence the effect.

m4dcoder commented 5 years ago

This is not a bug either to be precise. Is it performing the way it's suppose to? Did the concurrency policy delayed the action execution? The behavior of having many status change recorded in the object can be categorized as annoying.

nmaludy commented 5 years ago

@m4dcoder i found this behavior when debugging an issue where a bunch of our actions were stuck in a delayed state. I haven't able to create steps to reproduce that issue, but noticed that constantly reading/writing these actions from the database in this "busy loop" was probably not helping anything.

Kami commented 5 years ago

I was able to reproduce this issue, at least the flipping back and forth between policy-delayed and delayed status part.

Even if the end result and policies work correctly, this flipping back and forth is not desired, especially because on each of the state change we need to update execution object (log attribute on it) and that's very wasteful (expensive and unnecessary db write). It also causes unnecessary growing of the Execution object.

Example output:

st2 execution get  5cd14ac40761292469c559aa --attr log
log: - status: requested
  timestamp: '2019-05-07T09:07:16.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:16.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:16.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:18.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:18.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:20.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:20.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:22.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:22.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:24.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:24.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:26.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:26.000000Z'
- status: policy-delayed
  timestamp: '2019-05-07T09:07:28.000000Z'
- status: delayed
  timestamp: '2019-05-07T09:07:28.000000Z'
...

It could cause very massive execution objects for executions which are delayed for longer time periods (it flips status back and forth around every two seconds which is the POLICY_DELAYED_EXECUTION_RESCHEDULE_TIME_MS interval).

And it also generates tons of (unnecessary) log traffic in st2scheduler.

EDIT: Actual policies itself seem to work correctly, but this flipping back and forth still needs to be addressed.