StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.08k stars 747 forks source link

Mistral: 2 bugs in one: workflow timeout/cancellation and wait-before #3292

Closed emptywee closed 4 years ago

emptywee commented 7 years ago

Hi,

Discovered two bugs while I was trying to reproduce only one :-)

Running latest st2 v2.2.0.

  1. Something is off with cancellation and timed out tasks when nested tasks are mistral workflows. You cancel the parent one, it doesn't affect children. Also, sometimes cancelling them causes all sorts of unexpected results, like indefinite chatops notifications being triggered. In my example the parent task times out, but the child one keeps running, and running, and running, and running, over and over. If you remove wait-before parameter, then it's gonna finish after all retries are exhausted. Still, doesn't mean it's a valid workaround.

  2. Adding wait-before to a task causes it re-init previously published variables (at least it feels like so).

Providing simple workflows and alias to reproduce it (sorry for the names I gave to them):

wf_cancelation_issue.meta.yaml:

---
name: wf_cancelation_issue
parameters:
  skip_notify:
    default:
      - task
      - error
      - success
    type: array
    description: List of tasks to skip notifications for.
  task:
    type: string
    description: The name of the task to run for reverse workflow.
  workflow:
    type: string
    description: The name of the workflow to run if the entry_point is a workbook
      of many workflows. The name should be in the format "<pack_name>.<action_name>.<workflow_name>".
      If entry point is a workflow or a workbook with a single workflow, the runner
      will identify the workflow automatically.
  context:
    default: {}
    type: object
    description: Additional workflow inputs.
tags: []
description: Reproducing a bug with mistral when task is cancelled or timedout
enabled: true
entry_point: workflows/wf_cancelation_issue.yaml
notify: {}
uid: action:c_int:wf_cancelation_issue
pack: c_int
ref: c_int.wf_cancelation_issue
runner_type: mistral-v2

workflows/wf_cancelation_issue.yaml:

---
version: '2.0'

c_int.wf_cancelation_issue:

  tasks:
    task:
      action: core.noop
      on-success:
        - success

    success:
      action: c_int.wf_cancelation_issue_inner
      timeout: 30

wf_cancelation_issue_inner.meta.yaml:

---
name: wf_cancelation_issue_inner
parameters:
  skip_notify:
    default:
      - task1
      - increase_attempt_number
      - task3
      - end
    type: array
    description: List of tasks to skip notifications for.
  task:
    type: string
    description: The name of the task to run for reverse workflow.
  workflow:
    type: string
    description: The name of the workflow to run if the entry_point is a workbook
      of many workflows. The name should be in the format "<pack_name>.<action_name>.<workflow_name>".
      If entry point is a workflow or a workbook with a single workflow, the runner
      will identify the workflow automatically.
  context:
    default: {}
    type: object
    description: Additional workflow inputs.
  retries:
    type: integer
    required: false
    default: 5

tags: []
description: Reproducing a bug with mistral when task is cancelled or timedout
enabled: true
entry_point: workflows/wf_cancelation_issue_inner.yaml
notify: {}
uid: action:c_int:wf_cancelation_issue_inner
pack: c_int
ref: c_int.wf_cancelation_issue_inner
runner_type: mistral-v2

workflows/wf_cancelation_issue_inner.yaml:

---
version: '2.0'

c_int.wf_cancelation_issue_inner:
  type: direct
  input:
    - retries

  tasks:
    task1:
      action: core.noop
      on-success:
        - increase_attempt_number

    increase_attempt_number:
      action: core.noop
      publish:
        attempt: <% ($.get('attempt') or 0) + 1 %>
      on-success:
        - task3

    task3:
      wait-before: 10
      action: core.local
      input:
        cmd: 'echo <% $.attempt %>; exit 1'
      on-success:
        - end
      on-error:
        - increase_attempt_number: <% $.attempt < $.retries %>

    end:
      action: core.noop

aliases/wf_cancelation_issue.yaml

---
name: alias_wf_cancelation_issue
enabled: true
action_ref: c_int.wf_cancelation_issue
description: Testing timeout and cancelation issue
formats:
  - display: "wf_cancel_test"
    representation:
      - "wf_cancel_test"
ack:
  enabled: true
  format: 'WF Cancelation and Timeout workflow started...'
  append_url: true
result:
  extra:
    slack:
      color: "{% if execution.result is defined and execution.result.extra is defined and execution.result.extra.state is defined and execution.result.extra.state == 'SUCCESS' %}#219939{% else %}#d80015{% endif %}"
  format: |
    WF cancelation and timeout task is complete. {~}
{% if execution.result is defined and execution.result.extra is defined and execution.result.extra.state is defined and execution.result.extra.state == 'SUCCESS' %}
All good.
{% else %}
No good.
{% endif %}
```
m4dcoder commented 7 years ago

Adding wait-before to a task causes it re-init previously published variables

I'm unable to reproduce this.

m4dcoder commented 7 years ago

You cancel the parent one, it doesn't affect children

To be clear, the parent workflow in this examples calls another workflow action in StackStorm. The parent workflow is not from a mistral workbook calling other sub workflows defined within the workbook.

When a mistral workflow is cancelled in StackStorm and if there's a task that is still running, the workflow will be marked cancelled but the task will not be affected. StackStorm and Mistral will let the task(s) reach completion gracefully. If the task is another workflow action, you can request cancellation on the task and that will lead to the subworkflow being cancelled and bubbled up to the parent.

m4dcoder commented 7 years ago

In my example the parent task times out, but the child one keeps running, and running, and running, and running, over and over.

The behavior of timeout is similar to a cancel. Currently if a task times out, the parent workflow will fail but the task is allowed to complete gracefully. This is by design as Mistral do not know what is being run in the task and it may cause unintended consequences if forcefully cancelled or terminated. This is left to the user to manage. As for why the child task keeps running and running and over and over. This probably has to do w/ your child task which is in a loop. This is probably why there is a continuous chatops notification.

m4dcoder commented 7 years ago

So in summary, this does not look like a bug to me. If you feel otherwise, please reopen the issue and present more details.

emptywee commented 7 years ago

Could you show what is being printed by the core.local echo command if you keep wait-before parameter in the task? I can post my results of running those workflows.

emptywee commented 7 years ago

image

emptywee commented 7 years ago

If this is not a bug, I can't even think what is a bug and how it can be handled by end user.

As to handling cancellation of parent tasks, could you give a short example how this can be handled by users? Let's say I want to gracefully stop my mistral workflow if parent workflow is cancelled, provided that I don't know how many levels the workflow is nested. I mean, the workflow can be started by another workflow, which in its turn was started by another mistral workflow.

How this is supposed to be handled? I'd prefer to have a cascading cancellation request sent down the all running tasks started by the cancelled workflow, including all children, and to the very bottom. At least to have that option, even if it's not gonna be on by default.

I am very surprised (to say the least) that none of these flaws were considered worth looking at and/or fixing :-(

m4dcoder commented 7 years ago

@emptywee regarding the publish and wait-before issue you're reporting...

Sample Wait-Before Workflow

sandbox.pub_wait:
    type: direct
    output:
        vars:
            - <% $.var1 %>
            - <% $.var2 %>
    tasks:
        task1:
            action: core.noop
            publish:
                var1: 'abc'
            on-success:
                - task2
        task2:
            wait-before: 3
            action: core.local
            input:
                cmd: 'echo "<% $.var1 %>"'
            publish:
                var2: <% task(task2).result.stdout %>
Result

var1 published after task1 is being echo'd correctly in task2 after wait-before. Please look at the list of "vars" in the result for confirmation.


    "action": {
        "ref": "sandbox.pub_wait"
    }, 
    "context": {
        "user": "stanley"
    }, 
    "end_timestamp": "2017-03-27T20:49:49.868060Z", 
    "id": "58d97ae41e2e242f04192b05", 
    "liveaction": {
        "action": "sandbox.pub_wait", 
        "action_is_workflow": true, 
        "callback": {}, 
        "id": "58d97ae41e2e242f04192b04", 
        "parameters": {}, 
        "runner_info": {
            "hostname": "arkham", 
            "pid": 11901
        }
    }, 
    "result": {
        "extra": {
            "state": "SUCCESS", 
            "state_info": null
        }, 
        "tasks": [
            {
                "created_at": "2017-03-27 20:49:40", 
                "id": "16784a1e-4614-4917-aa9f-c659df132568", 
                "input": null, 
                "name": "task1", 
                "published": {
                    "var1": "abc"
                }, 
                "result": {
                    "failed": false, 
                    "return_code": 0, 
                    "succeeded": true
                }, 
                "state": "SUCCESS", 
                "state_info": null, 
                "updated_at": "2017-03-27 20:49:41", 
                "workflow_execution_id": "6a251d4e-5f7f-40eb-82a2-f11b28228a34", 
                "workflow_name": "sandbox.pub_wait"
            }, 
            {
                "created_at": "2017-03-27 20:49:41", 
                "id": "f1160cd0-497f-42bb-b1bd-ba7caf83c98f", 
                "input": null, 
                "name": "task2", 
                "published": {
                    "var2": "abc"
                }, 
                "result": {
                    "failed": false, 
                    "return_code": 0, 
                    "stderr": "", 
                    "stdout": "abc", 
                    "succeeded": true
                }, 
                "state": "SUCCESS", 
                "state_info": null, 
                "updated_at": "2017-03-27 20:49:44", 
                "workflow_execution_id": "6a251d4e-5f7f-40eb-82a2-f11b28228a34", 
                "workflow_name": "sandbox.pub_wait"
            }
        ], 
        "vars": [
            "abc", 
            "abc"
        ]
    }, 
    "start_timestamp": "2017-03-27T20:49:40.189381Z", 
    "status": "succeeded"
}```
m4dcoder commented 7 years ago

@emptywee I think the problem you're having is this YAQL expression <% ($.get('attempt') or 0) + 1 %>. It's always returning 0.

m4dcoder commented 7 years ago

@emptywee so for the workflow below, i did find the workflow to go into infinite loop. the published var is not incremented in this case as you've found. Not sure why but we'll track this as an issue. Thanks for your persistence.

version: '2.0'

sandbox.pub_wait:
    type: direct
    vars:
        var1: 0
        retries: 3
    output:
        vars:
            - <% $.var1 %>
            - <% $.var2 %>
    tasks:
        init:
            action: core.noop
            on-success:
                - task1
        task1:
            action: core.noop
            publish:
                var1: <% $.var1 + 1 %>
            on-success:
                - task2
        task2:
            wait-before: 3
            action: core.local
            input:
                cmd: 'echo "<% $.var1 %>"; exit 1'
            publish:
                var2: <% task(task2).result.stdout %>
            on-error:
                - task1: <% $.var1 <= $.retries %>
m4dcoder commented 7 years ago

As for the cascading cancellation of children workflow execution, we'll revisit this and look for options. Next time, it'll be more helpful if these are tracked as separate issues.

m4dcoder commented 7 years ago

I am very surprised (to say the least) that none of these flaws were considered worth looking at and/or fixing :-(

@emptywee If they're not worth looking, we will not be having this conversation. I'm very disappointed in this comment since we've been providing you w/ responsive and excellent support.

emptywee commented 7 years ago

Now we are both upset :( Sorry about that, didn't mean it.

Still, I don't understand why you ask for sample workflows, but try to reproduce with a "slightly" different case.

Again, I apologize if it sounded wrong, offensive or in anyway inappropriate. I did not really mean anything, just got a bit upset.

m4dcoder commented 7 years ago

Bug report registered in OpenStack @ https://bugs.launchpad.net/mistral/+bug/1681562

m4dcoder commented 7 years ago

The source of the wait-before bug has been identified at https://bugs.launchpad.net/mistral/+bug/1681562. Please follow the link to review comments. We will need to wait for the rest of the Mistral core team to provide feedback on the use of the cache and how to workaround this issue.

emptywee commented 7 years ago

Awesome. Let's hope for quick resolution. Thanks!

m4dcoder commented 7 years ago

Improvement on 2nd issue w/ canceling subworkflows @ https://github.com/StackStorm/st2/pull/3375

m4dcoder commented 7 years ago

Again, please separate issues in different post next time.