Open zsmanjot opened 2 years ago
Can you share your policy, workflow, and action files? Its not easy to understand this without context of your configurations.
Wont be able to share the workflow though but i have the copy of policy. Here you go: Policy
name: zsi_smavd.concurrency pack: zsnagios description: Limits the concurrent executions of the SMAVD_MEMORY action. enabled: true resource_ref: zsnagios.smavd_mem policy_type: action.concurrency parameters: action: delay threshold: 1
Ok I think I'm misunderstanding the initial issue you're having. Again, its difficult to assist when there's a lack of configuration information.
Anyhow, I'm assuming zsnagios.smavd_mem
is the action that references your actual workflow. I don't think there's any specific order to how workflows are executed. The policy just manages how many named actions can execute at any one given time. When an action is created its put into a queue and any one of the action runner threads can pick it up to execute.
If you need a specific order to your workflows you'll need to come up with something else. Like a workflow that executes other workflows in a specific order.
HI @minsis
So basically, it is a single workflow on which concurrency has been applied. And when we get triggers then workflow is invoked. Because we donot want to take actions on all the triggers at the same time for the same workflow , so we have put concurrency. Now at a time only on one trigger the workflow runs and rest all gets queued up because of concurrency (as expected). But it is seen that the one that is queued first is executed later and the one that is queued last is executed first. So, its kind of that it is actually delaying the remediation on already queued up alert. This is something one cant afford in production environment.
Any thoughts about it?
Without digging into code I'm not sure why it would be going LIFO, normally MQ goes FIFO. I dont know what was intended here, if anything at all.
This is what i am wondering. It should be FIFO. But its not the case. The intention is to not interrupt the functioning of the production services. So, basically the workflow is restarting a critical service if we get the alert that it requires restart and because we can not afford to restart all the services at the same time, we are leveraging concurrency to proceed with the executions one at a time.
Your need makes sense, but what doesn't make sense is the need for order of operation by relying on queued messages. You shouldn't be relying on a trigger system to queue a bunch of services to restart and pray that it goes in order. It should be a single trigger to execute a workflow that executes a handful of tasks (in order) of the services that need to be restarted.
Yours:
Trigger "restart service xyz" -> Run restart service workflow
Trigger "restart service abc" -> Run restart service workflow
Trigger "restart service 123" -> Run restart service workflow
Trigger "restart service 987" -> Run restart service workflow
It should be:
Trigger "restart services" -> Run restart services workflow
-> Task "restart service xyz"
-> Task "restart service abc"
-> Task "restart service 123"
-> Task "restart service 987"
Yes, makes sense. But the thing is, we never know when any service will require restart. It might be 2 or 3 alerts together or it might be the case that 1 alert came at some time and then the next consecutive alert came 20 minutes later.
Trigger "restart service abc at 10:20" --> Run restart service workflow
Trigger "restart service xyz at 10:20" --> Run restart service workflow
and second case :
Trigger "restart service abc at 10:20" --> Run restart service workflow
Trigger "restart service xyz at 10:35" --> Run restart service workflow
Anyhow whatever approach is followed but the sequence shouldn't vary,. It should be FIFO
I have a workflow with a concurrency policy that only allows 1 instance of the Action to run at a time. But i could see that the sequence followed here is LIFO(Last in first out) sometimes. The logs states that the trigger happened in the sequence of 6220edd5a3eb83216afd4cd5 first and then 6220ee28a3eb83216afd4ce3 , also the delay sequence is same ,but the execution sequence is 6220ee28a3eb83216afd4ce3 first and 6220edd5a3eb83216afd4cd5 later.
Logs from scheduler: For ID 6220edd5cefb8a6ece90577e
6144 2022-03-03 16:33:25,869 139825213000440 INFO handler [-] [6220edd5a3eb83216afd4cd5] Retrieved item "6220edd5cefb8a6ece90577e" from scheduling queue. 6145 2022-03-03 16:33:25,871 139825213000896 INFO handler [-] [6220edd5a3eb83216afd4cd5] Scheduling Liveaction "6220edd5a3eb83216afd4cd4". (queue_item_id='6220edd5cefb8a6ece90577e') 6146 2022-03-03 16:33:25,874 139825213000896 INFO handler [-] [6220edd5a3eb83216afd4cd5] Liveaction "6220edd5a3eb83216afd4cd4" has status "requested" before applying policies. (queue_item_id='6220edd5c efb8a6ece90577e') 6147 2022-03-03 16:33:25,876 139825213000896 INFO policies [-] Applying policy "zsmanjot.process.concurrency" (action.concurrency) for liveaction "6220edd5a3eb83216afd4cd4".
For ID 6220ee28a3eb83216afd4ce3
6351 2022-03-03 16:34:49,086 139825213000440 INFO handler [-] [6220ee28a3eb83216afd4ce3] Retrieved item "6220ee28cefb8a6ece905780" from scheduling queue. 6352 2022-03-03 16:34:49,088 139825213000896 INFO handler [-] [6220ee28a3eb83216afd4ce3] Scheduling Liveaction "6220ee28a3eb83216afd4ce2". (queue_item_id='6220ee28cefb8a6ece905780') 6353 2022-03-03 16:34:49,092 139825213000896 INFO handler [-] [6220ee28a3eb83216afd4ce3] Liveaction "6220ee28a3eb83216afd4ce2" has status "requested" before applying policies. (queue_item_id='6220ee28c efb8a6ece905780') 6354 2022-03-03 16:34:49,094 139825213000896 INFO policies [-] Applying policy "zsmanjot.process.concurrency" (action.concurrency) for liveaction "6220ee28a3eb83216afd4ce2".
2022-03-03 16:34:57,024 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Processing request for workflow execution. 2022-03-03 16:34:57,045 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Requesting conductor to start running workflow execution. 2022-03-03 16:34:57,050 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Identifying next set (iter 0) of tasks for workflow execution in status "running". 2022-03-03 16:34:57,057 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Identified the following set of tasks to execute next: check_for_inprogress_hosts (route 0) 2022-03-03 16:34:57,061 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Mark task "check_for_inprogress_hosts", route "0", in conductor as running. 2022-03-03 16:34:57,061 139653797263968 INFO workflows [-] [6220ee28a3eb83216afd4ce3] Updating workflow execution from **status "requested" to "running"**.
2022-03-03 16:36:22,267 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Processing request for workflow execution. 2022-03-03 16:36:22,290 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Requesting conductor to start running workflow execution. 2022-03-03 16:36:22,294 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Identifying next set (iter 0) of tasks for workflow execution in status "running". 2022-03-03 16:36:22,300 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Identified the following set of tasks to execute next: check_for_inprogress_hosts (route 0) 2022-03-03 16:36:22,303 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Mark task "check_for_inprogress_hosts", route "0", in conductor as running. 2022-03-03 16:36:22,304 139653800653632 INFO workflows [-] [6220edd5a3eb83216afd4cd5] Updating workflow execution from **status "requested" to "running"**.