StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.05k stars 745 forks source link

Concurrency policy only restricts executions in scheduled & running states #4779

Open trstruth opened 5 years ago

trstruth commented 5 years ago

SUMMARY

Concurrency policies on an action ref only checks for existing executions in scheduled & running states before executing. Paused workflows should be considered as well.

The code that implements the concurrency policy is here

    def _apply_before(self, target):
        # Get the count of scheduled instances of the action.
        scheduled = action_access.LiveAction.count(
            action=target.action, status=action_constants.LIVEACTION_STATUS_SCHEDULED)

        # Get the count of running instances of the action.
        running = action_access.LiveAction.count(
            action=target.action, status=action_constants.LIVEACTION_STATUS_RUNNING)

        count = scheduled + running

        # Mark the execution as scheduled if threshold is not reached or delayed otherwise.

I propose that the count should be extended to include some/all of

Consider:

    def _apply_before(self, target):
        # Get the count of non-completed instances of the action.

        # could move this into st2common.constants.action
        valid_concurrent_statuses = [
            action_constants.LIVEACTION_STATUS_REQUESTED,
            action_constants.LIVEACTION_STATUS_SCHEDULED,
            action_constants.LIVEACTION_STATUS_DELAYED,
            action_constants.LIVEACTION_STATUS_RUNNING,
            action_constants.LIVEACTION_STATUS_PENDING,
            action_constants.LIVEACTION_STATUS_PAUSING,
            action_constants.LIVEACTION_STATUS_PAUSED,
            action_constants.LIVEACTION_STATUS_RESUMING,
        ]
        count = sum([action_access.LiveAction.count(action=target.action, status=s) for s in valid_concurrent_statuses])

        # Mark the execution as scheduled if threshold is not reached or delayed otherwise.

I'd be willing to create a PR for this if the st2 team agrees that this is the desired behavior of the concurrency policy.

STACKSTORM VERSION

root@76ffd0a933ef:/# st2 --version
st2 3.1.0, on Python 2.7.6
OS, environment, install method

Docker mppc

Steps to reproduce the problem

Create a concurreny policy on a workflow that has a step which pauses it, create an execution of the workflow and wait until it pauses, create another workflow execution.

Expected Results

The concurrency policy should prevent another execution, as it is possible for both workflows to then be unpaused and running at the same time, which would break the concurrency policy.

Actual Results

Both workflow executions are created.

m4dcoder commented 5 years ago

The original intent of the concurrency policies is to limit the number of action executions currently executing, which if not bounded causes undesirable behavior such as number of concurrent operations on a resource like system, DB, file, etc. Therefore, the concurrency policies is only focused on executions that are either scheduled to run or is currently running. The action executions are not actually executing in states such as paused, pending, etc. There is a case to be made for pausing and canceling.

If we just start adding these states into the existing concurrency policies, this will change the existing behavior of st2 and will impact a lot of other users who may or may not have the same problem as you.

It seems like you have a use case in mind and specific to your need. May I suggest that you describe the use case you're trying to solve? We can renew this discussion when we understand better what problem you are trying to solve.

trstruth commented 5 years ago

Hi @m4dcoder, we have multiple users on our system who can run workflows that touch our production network. The workflows contain steps that, if run concurrently, can potentially cause damage. These workflows also contain inquiries to allow for a human to manually sign off on "checkpoints" in the workflow.

For safety reasons, we want to place a lock on a workflow ref with a given set of input args. That is, we only want one workflow ref/input arg combo to be able to run at one time, and the action.concurrency.attr seems to be targeted to this use case. Unfortunately, it does not consider the case where a workflow is paused, so it is possible to create a second execution while the first is paused. If an inquiry response unpauses the first execution, both executions could be running at the same time.

m4dcoder commented 5 years ago

@trstruth Thank you for describing your use case. So maybe we can look at one of the following options.

  1. Make current concurrency policies configurable to evaluate on given states provided by st2 admins.
  2. Implement a new concurrency policy that handles the state your use case requires.
  3. Refactor policy engine to be able to delayed the paused action execution from resuming.

Option 3 is the more appropriate but more difficult change. Option 1 and 2 are easier to implement if you want to tackle this on your own.