StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.08k stars 747 forks source link

Increase execution time for StackStorm workflows #5938

Open garadar opened 1 year ago

garadar commented 1 year ago

Issue Description

When I launch multiple workflows at the same time in StackStorm, I've noticed that the execution time of my workflows increases significantly.

STACKSTORM VERSION

st2 3.7.0, on Python 3.6.8

OS, environment, install method

Post what OS you are running this on, along with any other relevant information/

Steps to Reproduce

Launch multiple workflows at the same time in StackStorm Observe the execution time of each workflow

Expected Behavior

The execution time of each workflow should remain constant, regardless of how many workflows are launched simultaneously.

Actual Behavior

The execution time of each workflow increases as more workflows are launched simultaneously.

Here the example of multiple launch of the same workflows, the exec time is almost 25 sec or 2 sec (due to a cancel check) for a normal behavior. We can see, more I run workflow, more the exec time increase.

the issue seems to be that the workflow actions are getting stuck in a "pending" state, and we are not seeing any CPU overload in htop.

| succeeded (25s elapsed)  | Fri, 17 Mar 2023 13:20:32 CET | Fri, 17 Mar 2023 13:20:57 CET |
|                          |                               |                               |
| succeeded (33s elapsed)  | Fri, 17 Mar 2023 13:28:46 CET | Fri, 17 Mar 2023 13:29:19 CET |
|                          |                               |                               |
| succeeded (2s elapsed)   | Fri, 17 Mar 2023 13:28:50 CET | Fri, 17 Mar 2023 13:28:52 CET |
|                          |                               |                               |
| succeeded (3s elapsed)   | Fri, 17 Mar 2023 13:28:53 CET | Fri, 17 Mar 2023 13:28:56 CET |
|                          |                               |                               |
| succeeded (2s elapsed)   | Fri, 17 Mar 2023 13:28:57 CET | Fri, 17 Mar 2023 13:28:59 CET |
|                          |                               |                               |
| succeeded (162s elapsed) | Fri, 17 Mar 2023 13:29:01 CET | Fri, 17 Mar 2023 13:31:43 CET |
|                          |                               |                               |
| succeeded (217s elapsed) | Fri, 17 Mar 2023 13:29:04 CET | Fri, 17 Mar 2023 13:32:41 CET |
|                          |                               |                               |
| succeeded (258s elapsed) | Fri, 17 Mar 2023 13:29:08 CET | Fri, 17 Mar 2023 13:33:26 CET |
|                          |                               |                               |
| succeeded (285s elapsed) | Fri, 17 Mar 2023 13:29:12 CET | Fri, 17 Mar 2023 13:33:57 CET |
|                          |                               |                               |
| succeeded (311s elapsed) | Fri, 17 Mar 2023 13:29:16 CET | Fri, 17 Mar 2023 13:34:27 CET |
|                          |                               |                               |
| succeeded (335s elapsed) | Fri, 17 Mar 2023 13:29:19 CET | Fri, 17 Mar 2023 13:34:54 CET |
|                          |                               |                               |
| succeeded (354s elapsed) | Fri, 17 Mar 2023 13:29:23 CET | Fri, 17 Mar 2023 13:35:17 CET |
|                          |                               |                               |
| succeeded (370s elapsed) | Fri, 17 Mar 2023 13:29:26 CET | Fri, 17 Mar 2023 13:35:36 CET |
|                          |                               |                               |
| succeeded (383s elapsed) | Fri, 17 Mar 2023 13:29:30 CET | Fri, 17 Mar 2023 13:35:53 CET |
|                          |                               |                               |
| succeeded (395s elapsed) | Fri, 17 Mar 2023 13:29:34 CET | Fri, 17 Mar 2023 13:36:09 CET |
|                          |                               |                               |
| succeeded (400s elapsed) | Fri, 17 Mar 2023 13:29:38 CET | Fri, 17 Mar 2023 13:36:18 CET |
|                          |                               |                               |
| succeeded (406s elapsed) | Fri, 17 Mar 2023 13:29:41 CET | Fri, 17 Mar 2023 13:36:27 CET |
|                          |                               |                               |
| succeeded (409s elapsed) | Fri, 17 Mar 2023 13:29:45 CET | Fri, 17 Mar 2023 13:36:34 CET |
|                          |                               |                               |
| succeeded (412s elapsed) | Fri, 17 Mar 2023 13:29:49 CET | Fri, 17 Mar 2023 13:36:41 CET |
|                          |                               |                               |
| succeeded (415s elapsed) | Fri, 17 Mar 2023 13:29:52 CET | Fri, 17 Mar 2023 13:36:47 CET |
|                          |                               |                               |
| succeeded (299s elapsed) | Fri, 17 Mar 2023 13:29:56 CET | Fri, 17 Mar 2023 13:34:55 CET |
|                          |                               |                               |
| succeeded (419s elapsed) | Fri, 17 Mar 2023 13:29:59 CET | Fri, 17 Mar 2023 13:36:58 CET |
|                          |                               |                               |
| succeeded (419s elapsed) | Fri, 17 Mar 2023 13:30:03 CET | Fri, 17 Mar 2023 13:37:02 CET |
|                          |                               |                               |
| succeeded (421s elapsed) | Fri, 17 Mar 2023 13:30:07 CET | Fri, 17 Mar 2023 13:37:08 CET |
|                          |                               |                               |
| succeeded (422s elapsed) | Fri, 17 Mar 2023 13:30:11 CET | Fri, 17 Mar 2023 13:37:13 CET |
|                          |                               |                               |
| succeeded (424s elapsed) | Fri, 17 Mar 2023 13:30:14 CET | Fri, 17 Mar 2023 13:37:18 CET |
|                          |                               |                               |
| succeeded (421s elapsed) | Fri, 17 Mar 2023 13:30:18 CET | Fri, 17 Mar 2023 13:37:19 CET |
|                          |                               |                               |
| succeeded (421s elapsed) | Fri, 17 Mar 2023 13:30:21 CET | Fri, 17 Mar 2023 13:37:22 CET |
|                          |                               |                               |
| succeeded (420s elapsed) | Fri, 17 Mar 2023 13:30:25 CET | Fri, 17 Mar 2023 13:37:25 CET |
|                          |                               |                               |
| succeeded (419s elapsed) | Fri, 17 Mar 2023 13:30:29 CET | Fri, 17 Mar 2023 13:37:28 CET |
|                          |                               |                               |
| succeeded (418s elapsed) | Fri, 17 Mar 2023 13:30:32 CET | Fri, 17 Mar 2023 13:37:30 CET |
|                          |                               |                               |
| succeeded (418s elapsed) | Fri, 17 Mar 2023 13:30:36 CET | Fri, 17 Mar 2023 13:37:34 CET |
|                          |                               |                               |
| succeeded (415s elapsed) | Fri, 17 Mar 2023 13:30:40 CET | Fri, 17 Mar 2023 13:37:35 CET |
|                          |                               |                               |
| succeeded (414s elapsed) | Fri, 17 Mar 2023 13:30:43 CET | Fri, 17 Mar 2023 13:37:37 CET |
|                          |                               |                               |
| succeeded (417s elapsed) | Fri, 17 Mar 2023 13:30:47 CET | Fri, 17 Mar 2023 13:37:44 CET |
|                          |                               |                               |
| succeeded (320s elapsed) | Fri, 17 Mar 2023 13:30:51 CET | Fri, 17 Mar 2023 13:36:11 CET |
|                          |                               |                               |
| succeeded (406s elapsed) | Fri, 17 Mar 2023 13:30:54 CET | Fri, 17 Mar 2023 13:37:40 CET |
|                          |                               |                               |
| succeeded (29s elapsed)  | Fri, 17 Mar 2023 13:30:58 CET | Fri, 17 Mar 2023 13:31:27 CET |
|                          |                               |                               |
| succeeded (404s elapsed) | Fri, 17 Mar 2023 13:31:01 CET | Fri, 17 Mar 2023 13:37:45 CET |
|                          |                               |                               |
| succeeded (401s elapsed) | Fri, 17 Mar 2023 13:31:05 CET | Fri, 17 Mar 2023 13:37:46 CET |
|                          |                               |                               |
| succeeded (398s elapsed) | Fri, 17 Mar 2023 13:31:09 CET | Fri, 17 Mar 2023 13:37:47 CET |
|                          |                               |                               |
| succeeded (396s elapsed) | Fri, 17 Mar 2023 13:31:12 CET | Fri, 17 Mar 2023 13:37:48 CET |
|                          |                               |                               |
| succeeded (393s elapsed) | Fri, 17 Mar 2023 13:31:16 CET | Fri, 17 Mar 2023 13:37:49 CET |
|                          |                               |                               |
| succeeded (389s elapsed) | Fri, 17 Mar 2023 13:31:20 CET | Fri, 17 Mar 2023 13:37:49 CET |
|                          |                               |                               |
| succeeded (386s elapsed) | Fri, 17 Mar 2023 13:31:24 CET | Fri, 17 Mar 2023 13:37:50 CET |
|                          |                               |                               |
+--------------------------+-------------------------------+-------------------------------+
fdrab commented 1 year ago

Are you doing this via re-run of previous runs? Or brand new flow runs?

guzzijones commented 1 year ago

most likely due to mongodb bottlenecking. Please post your workflow so we can explain further details, though.

garadar commented 1 year ago

Thank you, Here my workflow:

workflow ^

version: 1.0

description: "Engage the deletion process"

input:
  - username

# Be careful, to the filename
vars:
  - tmp_mail_path: '/opt/stackstorm/packs/blablabla/actions/workflows/template'
  - mail_file: '<% ctx(tmp_mail_path) %>/<% ctx(username) %>.mail'

tasks:
  check_key_store:
    action: core.local_sudo
    input:
      cmd: st2 key get deleting_process_<% ctx(username) %>
    next:
      -
        when: <% failed() %>
        do: get_user_info
      - when: <%  succeeded() %>
        do: noop

  # ## start engage deletion normal
  get_user_info:
    action: unige.get_user_info
    input:
      username: <% ctx().username %>
    next:
      -
        publish:
          - user_info: <% result().output.result %>
        when: <% succeeded()  %>
        do:
          - check_lastlog
      -
        when: <% failed() %>
        publish:
          - user_info: <% dict(user => ctx().username, gecos => "", mail => "", lastlog => "", PIname => "", PIgecos => "", PImail => "", quota => "") %>
        do: generate_mail_deletion_direct

  check_lastlog:
    action: core.local
    input:
      cmd: test <% ctx(user_info).lastlog %> -lt $(date -d "now - 11 month" +%Y%m%d)
    next:
      - when: <% succeeded() %>
        do: generate_mail_deletion_direct

  generate_mail_deletion_direct:
    action: core.local_sudo
    input:
      cmd: |
        sed -e "s/XXX_USER_XXX/<% ctx(user_info).user %>/g" \
        -e "s/XXX_USER_MAIL_XXX/<% ctx(user_info).mail %>/g" \
        -e "s/XXX_USER_NAME_XXX/<% ctx(user_info).gecos %>/g" \
        <% ctx(tmp_mail_path) %>/mail_deletion_first_contact.mail > <% ctx(mail_file) %>.firstcontact
    next:
      -
        do: send_mail

  send_mail:
    action: core.local
    input:
      cmd: sendmail -vt < <% ctx(mail_file) %>.firstcontact
    next:
      - do: get_current_date

  get_current_date:
    action: core.local
    input:
      cmd: date +%Y%m%d
    next:
      - publish:
          - current_date: <% result().stdout %>
        do:
          - get_deadline

  get_deadline:
    action: core.local
    input:
      cmd: date -d "today +1 month" +%Y%m%d
    next:
      - publish:
         - deadline: <% result().stdout %>
        do:
          - set_key_store

  set_key_store:
    action: st2.kv.set_object
    input:
      key: deleting_process_<% ctx(username) %>
      value: '{"start_time": <% ctx(current_date) %>, "deadline": <% ctx(deadline) %>}'
## End engage deletion
garadar commented 1 year ago

Are you doing this via re-run of previous runs? Or brand new flow runs?

Each time is a new execution of the workflows

guzzijones commented 1 year ago

There is only 1 workflow engine. so if the context of this workflow starts getting large that will slow down a workflow itself. I don't see any obvious large jsons here. How many st2action runners do you have running. likely you are saturating them. Add more st2 action runners and see what happens.

guzzijones commented 1 year ago

Also you could offload the local_sudo to a remote vm using core.remote. This way your stackstorm server doesn't have to take the load of sending and formatting the email.

fdrab commented 1 year ago

There is only 1 workflow engine. so if the context of this workflow starts getting large that will slow down a workflow itself. I don't see any obvious large jsons here. How many st2action runners do you have running. likely you are saturating them. Add more st2 action runners and see what happens.

Isn't this an obvious bottleneck though? I'm doing some performance testing currently. To illustrate, I'm reading ServiceNow workgroups (16 of them in parallel) to look for incidents to process and I'm further processing the returned JSONs (ranging from few tickets to JSONs 0.25M size) by applying various filters and python scripts on them. What I've noticed is that the workflow engine is fully utilizing a single core to 99% while the rest of the system is relatively idle, resulting in dramatic performance decrease of workflows while the server is 15% CPU loaded. This also manifests in a workflow consisting of 4 tasks that take 3 seconds each having a runtime of 25+ seconds due to slow task transitions. I've set the worker amount to 30 and also increased the worker thread pool size and action thread pool size.

Current HW setup is a Proliant DL 360 with 16 cores and 64G memory and the instance is not running in a container, with both the instance and MongoDB on a single server.

Can these issues be alleviated in a HA setup with multiple st2workflowengine containers?

nzlosh commented 1 year ago

@fdrab From what you've described, I suggest you write a custom sensor to handle the polling and data processing and generate event triggers that would call the appropriate workflow via rule match. This would unburden the workflow engine from all the data processing.

garadar commented 1 year ago

So it's not recommended to run several workflow by hand ? we should run them via sensor => rules => execution ?

guzzijones commented 1 year ago

I am working on a pr in orquesta to remove all the deepcopy. Also we are working to add a zstandard compression and remove redundant action data. Likely you are hitting the max the single workflow engine can handle.

fdrab commented 1 year ago

@guzzijones This is great, thank you! My use case involves processing of a lot of ServiceNow workgroups and executing a LOT of workflows on tickets, oftentimes simultaneously (due to the polling nature). I've already written a sensor for querying and data processing so that we don't rely on orquesta workflows to do the job, thereby offloading this work to a dedicated sensor process.

However, I'd like to point out that one of the core features of ST2 is the ability to create complex workflows out of simple operations and use those complex workflows in even more complex flows. Therefore having to rely on a single st2workflowengine in non-HA setting, which is relatively easily slowed down to a crawl, seems a bit puzzling. I wish we would be able to increase the number of these processes even on standalone installation, same as we can with st2actionrunner processes.

guzzijones commented 1 year ago

Be aware that sensors die silently after 2 failures and their state is not saved.

guzzijones commented 1 year ago

You could copy the systemd script to add more workflow engines.

fdrab commented 1 year ago

Be aware that sensors die silently after 2 failures and their state is not saved.

what do you mean by failures? during development my sensors died when my script threw an exception, so I wrapped a lot of what poll() does in in try / except blocks, so the sensor will just log an error message if something fails. I can save state in the datastore if I need to I think. I can perhaps also have some sort of monitoring to check whether the appropriate sensorcontainer.log hasn't been updated in 20 minutes and do something if that's the case.

chain312 commented 1 year ago

@guzzijones This is great, thank you! My use case involves processing of a lot of ServiceNow workgroups and executing a LOT of workflows on tickets, oftentimes simultaneously (due to the polling nature). I've already written a sensor for querying and data processing so that we don't rely on orquesta workflows to do the job, thereby offloading this work to a dedicated sensor process.

However, I'd like to point out that one of the core features of ST2 is the ability to create complex workflows out of simple operations and use those complex workflows in even more complex flows. Therefore having to rely on a single st2workflowengine in non-HA setting, which is relatively easily slowed down to a crawl, seems a bit puzzling. I wish we would be able to increase the number of these processes even on standalone installation, same as we can with st2actionrunner processes.

I deployed a lot of replicated st2 services with K8S HA, and now there is a bottleneck in performance. In the initial optimization, I changed a mongo slow index, which improved some speed, but now I don't know where to expand performance.

guzzijones commented 1 year ago

how many st2workflowengine runners do you have running? The workflow engine has to write parameters, then the action runner writes it's parameters. again, I wouldn't use sensors.

chain312 commented 1 year ago

how many st2workflowengine runners do you have running? The workflow engine has to write parameters, then the action runner writes it's parameters. again, I wouldn't use sensors. I built a new issues.

6003