StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.08k stars 747 forks source link

Action executions running out of order when delayed #3225

Closed andrew-regan closed 7 years ago

andrew-regan commented 7 years ago

I have an Action with a concurrency policy that only allows 1 instance of the Action to run at a time. We’re intermittently seeing cases where multiple triggers are dispatched that lead to this Action being invoked close to each other. Strangely the Actions are not being invoked in the order I’d expect. I was under the impression that there’s a FIFO queue that controls how Actions are dispatched to runners that would cause the Actions to run in order? The snippet bellow shows the logs for 2 executions of this Action. In the snippet you can see that the logs indicate 587dcc8bb4eaa909d5ce0ebf was requested first, but it eventually gets delayed and 587dcc94b4eaa909d8bcc212 gets dispatched to a runner first.

This is seen in 2.1.1.

Action Execution ID: 587dcc8bb4eaa909d5ce0ebf
st2actionrunner.2517.log:2017-01-17 07:49:31,934 139902086060944 AUDIT action [-] The status of action execution is changed from requested to delayed
st2actionrunner.2517.log:2017-01-17 07:49:31,961 139902086060944 INFO action [-] The status of action execution is changed from requested to delayed
st2actionrunner.2517.log:2017-01-17 07:49:31,962 139902086060944 INFO scheduler [-] ActionExecutionScheduler is ignoring <class 'st2common
st2actionrunner.2517.log:2017-01-17 07:49:31,969 139902086059344 AUDIT action [-] Action execution requested
st2actionrunner.2520.log:2017-01-17 07:49:41,708 140505266806288 AUDIT action [-] The status of action execution is changed from requested to delayed
st2actionrunner.2520.log:2017-01-17 07:49:41,756 140505266806288 INFO action [-] The status of action execution is changed from requested to delayed
st2actionrunner.2520.log:2017-01-17 07:49:41,757 140505266806288 INFO scheduler [-] ActionExecutionScheduler is ignoring <class 'st2common
st2actionrunner.2517.log:2017-01-17 07:49:51,954 139902086060464 AUDIT action [-] The status of action execution is changed from requested to scheduled
st2actionrunner.2517.log:2017-01-17 07:49:52,035 139902086060464 INFO action [-] The status of action execution is changed from requested to scheduled
st2actionrunner.2517.log:2017-01-17 07:49:52,208 139902086061904 AUDIT worker [-] Launching action execution
st2actionrunner.2517.log:2017-01-17 07:49:52,263 139902086061904 INFO worker [-] Dispatched {~}action_execution
st2actionrunner.2517.log:2017-01-17 07:49:52,275 139902086061904 INFO base [-] Dispatching Action to a runner
st2actionrunner.2517.log:2017-01-17 07:50:00,453 139902086061904 AUDIT base [-] Liveaction completed
Action Execution ID: 587dcc94b4eaa909d8bcc212
st2actionrunner.2517.log:2017-01-17 07:49:41,413 139902089005968 AUDIT action [-] The status of action execution is changed from requested to scheduled
st2actionrunner.2517.log:2017-01-17 07:49:41,463 139902089005968 INFO action [-] The status of action execution is changed from requested to scheduled
st2actionrunner.2520.log:2017-01-17 07:49:40,838 140505265953680 AUDIT action [-] Action execution requested
st2actionrunner.2520.log:2017-01-17 07:49:41,536 140505265952400 AUDIT worker [-] Launching action execution
st2actionrunner.2520.log:2017-01-17 07:49:41,563 140505265952400 INFO worker [-] Dispatched {~}action_execution: 587dcc94b4eaa909d8bcc212 / {~}live_action: 587dcc94b4eaa909d8bcc211 with "running" status
st2actionrunner.2520.log:2017-01-17 07:49:41,571 140505265952400 INFO base [-] Dispatching Action to a runner
st2actionrunner.2520.log:2017-01-17 07:49:51,707 140505265952400 AUDIT base [-] Liveaction completed
andrew-regan commented 7 years ago

Capturing some comments from Slack related to this from @m4dcoder:

"in current implementation, there's a soft FIFO queue. on completion of the last action instance, if there's any delayed execution, st2 policy engine will look for the earliest delayed from the DB. but from the log entries, here's what i can deduce. there's already an instance of the action running at 2017-01-17 07:49:31 so 587dcc8bb4eaa909d5ce0ebf was put on delay. at 2017-01-17 07:49:41,413, the previous action execution already finished and while the policy engine is still getting ready, 587dcc94b4eaa909d8bcc212 was in the queue and scheduled. at 2017-01-17 07:49:41,708, 587dcc8bb4eaa909d5ce0ebf was evaluated again and it was cut in line by milliseconds..."

"there's a separate queue for actions being requested and those that are delayed atm."

andrew-regan commented 7 years ago

Any update on this? @bigmstone Any chance the "to be verified" label means there's a fix in the works?

bigmstone commented 7 years ago

@andrew-regan I've not verified this bug and am on another issue at the moment. @lakshmi-kannan is this something you could verify this week while you're on support?

andrew-regan commented 7 years ago

@bigmstone is the fix in a dev branch somewhere? Didn't see a PR. Thanks!

LindsayHill commented 7 years ago

@andrew-regan I think that what he is saying is that "to be verified" means someone needs to check through this and verify that yes, this is indeed a bug, it is related to those initial thoughts from Winson, and is reproducible.

It doesn't mean "there is a fix that needs to be tested and here is some code." If there was a PR, it would be tagged with this issue, and the comments would read something like "here is a potential fix for this problem, it needs to be tested to make sure it does resolve this issue"

andrew-regan commented 7 years ago

Gotcha...just wasn't sure what the label implied. I'm upgrading to 2.2. Will let you know if it reproduces there as well.

bigmstone commented 7 years ago

@andrew-regan and update on if you could reproduce in 2.2?

andrew-regan commented 7 years ago

Still waiting on some test results from QA. We're at 2.2.1 so will let you know if we see it there.

bigmstone commented 7 years ago

@andrew-regan thanks. I'm tracking this issue, so I'll help you get it closed.

andrew-regan commented 7 years ago

Just an update. Testing with StackStorm 2.2.1 is starting on our side. Will let you know if we see this reproduce again.

andrew-regan commented 7 years ago

@bigmstone, sorry for the delay...we are no longer seeing this issue with 2.2.1.

LindsayHill commented 7 years ago

Thanks for the update @andrew-regan. Let's close it for now, re-open if it occurs in a current release

zsmanjot commented 2 years ago

Still seeing the same thing. The workflow is put into delay by concurrency policy and the order of execution is newest to oldest. Why the first queued workflow is running later on? Stackstorm version being used is: st2 3.5.0, on Python 3.6.8

nzlosh commented 2 years ago

@zsmanjot This issue is from 2017 and marked as closed after confirmation that it was no longer a problem in version 2.2.1. Please create a new issue with any relevant logs and details to the issue you're seeing as the code base has evolved considerably over the last 4.5 years.

zsmanjot commented 2 years ago

Sure @nzlosh I am opening a new issue for this.