StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.08k stars 747 forks source link

Workflow executions running out of order when delayed. #5587

Open zsmanjot opened 2 years ago

zsmanjot commented 2 years ago

I have a workflow with a concurrency policy that only allows 1 instance of the Action to run at a time. But i could see that the sequence followed here is LIFO(Last in first out) sometimes. The logs states that the trigger happened in the sequence of 6220edd5a3eb83216afd4cd5 first and then 6220ee28a3eb83216afd4ce3 , also the delay sequence is same ,but the execution sequence is 6220ee28a3eb83216afd4ce3 first and 6220edd5a3eb83216afd4cd5 later.

Logs from scheduler: For ID 6220edd5cefb8a6ece90577e

6144 2022-03-03 16:33:25,869 139825213000440 INFO handler [-] [6220edd5a3eb83216afd4cd5] Retrieved item "6220edd5cefb8a6ece90577e" from scheduling queue. 6145 2022-03-03 16:33:25,871 139825213000896 INFO handler [-] [6220edd5a3eb83216afd4cd5] Scheduling Liveaction "6220edd5a3eb83216afd4cd4". (queue_item_id='6220edd5cefb8a6ece90577e') 6146 2022-03-03 16:33:25,874 139825213000896 INFO handler [-] [6220edd5a3eb83216afd4cd5] Liveaction "6220edd5a3eb83216afd4cd4" has status "requested" before applying policies. (queue_item_id='6220edd5c efb8a6ece90577e') 6147 2022-03-03 16:33:25,876 139825213000896 INFO policies [-] Applying policy "zsmanjot.process.concurrency" (action.concurrency) for liveaction "6220edd5a3eb83216afd4cd4".

For ID 6220ee28a3eb83216afd4ce3 6351 2022-03-03 16:34:49,086 139825213000440 INFO handler [-] [6220ee28a3eb83216afd4ce3] Retrieved item "6220ee28cefb8a6ece905780" from scheduling queue. 6352 2022-03-03 16:34:49,088 139825213000896 INFO handler [-] [6220ee28a3eb83216afd4ce3] Scheduling Liveaction "6220ee28a3eb83216afd4ce2". (queue_item_id='6220ee28cefb8a6ece905780') 6353 2022-03-03 16:34:49,092 139825213000896 INFO handler [-] [6220ee28a3eb83216afd4ce3] Liveaction "6220ee28a3eb83216afd4ce2" has status "requested" before applying policies. (queue_item_id='6220ee28c efb8a6ece905780') 6354 2022-03-03 16:34:49,094 139825213000896 INFO policies [-] Applying policy "zsmanjot.process.concurrency" (action.concurrency) for liveaction "6220ee28a3eb83216afd4ce2".

2022-03-03 16:34:57,024 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Processing request for workflow execution. 2022-03-03 16:34:57,045 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Requesting conductor to start running workflow execution. 2022-03-03 16:34:57,050 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Identifying next set (iter 0) of tasks for workflow execution in status "running". 2022-03-03 16:34:57,057 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Identified the following set of tasks to execute next: check_for_inprogress_hosts (route 0) 2022-03-03 16:34:57,061 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Mark task "check_for_inprogress_hosts", route "0", in conductor as running. 2022-03-03 16:34:57,061 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Updating workflow execution from **status "requested" to "running"**.

2022-03-03 16:36:22,267 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Processing request for workflow execution. 2022-03-03 16:36:22,290 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Requesting conductor to start running workflow execution. 2022-03-03 16:36:22,294 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Identifying next set (iter 0) of tasks for workflow execution in status "running". 2022-03-03 16:36:22,300 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Identified the following set of tasks to execute next: check_for_inprogress_hosts (route 0) 2022-03-03 16:36:22,303 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Mark task "check_for_inprogress_hosts", route "0", in conductor as running. 2022-03-03 16:36:22,304 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Updating workflow execution from **status "requested" to "running"**.

minsis commented 2 years ago

Can you share your policy, workflow, and action files? Its not easy to understand this without context of your configurations.

zsmanjot commented 2 years ago

Wont be able to share the workflow though but i have the copy of policy. Here you go: Policy


name: zsi_smavd.concurrency pack: zsnagios description: Limits the concurrent executions of the SMAVD_MEMORY action. enabled: true resource_ref: zsnagios.smavd_mem policy_type: action.concurrency parameters: action: delay threshold: 1

minsis commented 2 years ago

Ok I think I'm misunderstanding the initial issue you're having. Again, its difficult to assist when there's a lack of configuration information.

Anyhow, I'm assuming zsnagios.smavd_mem is the action that references your actual workflow. I don't think there's any specific order to how workflows are executed. The policy just manages how many named actions can execute at any one given time. When an action is created its put into a queue and any one of the action runner threads can pick it up to execute.

If you need a specific order to your workflows you'll need to come up with something else. Like a workflow that executes other workflows in a specific order.

zsmanjot commented 2 years ago

HI @minsis

So basically, it is a single workflow on which concurrency has been applied. And when we get triggers then workflow is invoked. Because we donot want to take actions on all the triggers at the same time for the same workflow , so we have put concurrency. Now at a time only on one trigger the workflow runs and rest all gets queued up because of concurrency (as expected). But it is seen that the one that is queued first is executed later and the one that is queued last is executed first. So, its kind of that it is actually delaying the remediation on already queued up alert. This is something one cant afford in production environment.

Any thoughts about it?

minsis commented 2 years ago

Without digging into code I'm not sure why it would be going LIFO, normally MQ goes FIFO. I dont know what was intended here, if anything at all.

zsmanjot commented 2 years ago

This is what i am wondering. It should be FIFO. But its not the case. The intention is to not interrupt the functioning of the production services. So, basically the workflow is restarting a critical service if we get the alert that it requires restart and because we can not afford to restart all the services at the same time, we are leveraging concurrency to proceed with the executions one at a time.

minsis commented 2 years ago

Your need makes sense, but what doesn't make sense is the need for order of operation by relying on queued messages. You shouldn't be relying on a trigger system to queue a bunch of services to restart and pray that it goes in order. It should be a single trigger to execute a workflow that executes a handful of tasks (in order) of the services that need to be restarted.

Yours:

Trigger "restart service xyz" -> Run restart service workflow
Trigger "restart service abc" -> Run restart service workflow
Trigger "restart service 123" -> Run restart service workflow
Trigger "restart service 987" -> Run restart service workflow

It should be:

Trigger "restart services" -> Run restart services workflow
                            -> Task "restart service xyz"
                            -> Task "restart service abc"
                            -> Task "restart service 123"
                            -> Task "restart service 987"
zsmanjot commented 2 years ago

Yes, makes sense. But the thing is, we never know when any service will require restart. It might be 2 or 3 alerts together or it might be the case that 1 alert came at some time and then the next consecutive alert came 20 minutes later.

Trigger "restart service abc at 10:20" --> Run restart service workflow

Trigger "restart service xyz at 10:20" --> Run restart service workflow

and second case :

Trigger "restart service abc at 10:20" --> Run restart service workflow

Trigger "restart service xyz at 10:35" --> Run restart service workflow

Anyhow whatever approach is followed but the sequence shouldn't vary,. It should be FIFO