StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.05k stars 745 forks source link

Slow (30sec+) transitions between tasks when using loops (`with`, `items`, `concurrency`) #4882

Open GuiTeK opened 4 years ago

GuiTeK commented 4 years ago

SUMMARY

It seems that when we using the with syntax:

  scan_servers:
    with:
      items: server in <% ctx(servers_to_check) %>
      concurrency: 1
    action: core.local
    input:
      cmd: "echo 'Hello <% item(server).id %>'"

transitions between two tasks are slow (30sec+). For comparison, transitions between two tasks when not using with is 0-2 seconds.

This is a problem for us because with a large number of tasks (e.g., if we want to check 500 servers), StackStorm will spend 500 * 30 = 15 000 seconds over 4 hours just waiting and doing nothing.

Questions:

STACKSTORM VERSION

st2 3.0.1, on Python 2.7.12

OS, environment, install method

Ubuntu 16.04, apt-get install stackstorm (added APT repository https://packagecloud.io/StackStorm/stable/ubuntu)

Steps to reproduce the problem

Here is a workflow we use to reproduce the problem. I replaced the real payload of the servers with fake JSON data from https://jsonplaceholder.typicode.com/.

version: 1.0

tasks:
  retrieve_server_list:
    action: core.local
    input:
      cmd: 'curl https://jsonplaceholder.typicode.com/comments'
    next:
      - when: <% succeeded() %>
        publish:
          - servers_to_check: <% result().stdout %>
        do:
          - scan_servers

  scan_servers:
    with:
      items: server in <% ctx(servers_to_check) %>
      concurrency: 1
    action: core.local
    input:
      cmd: "echo 'Hello <% item(server).id %>'"

Here we can see the (~ 30 seconds) in the logs as well:

2020-03-10 15:19:35,950 140241378099760 INFO workflows [-] [5e67af6b2c5a45070737a8dc] Identifying next set (0) of tasks after completion of task “scan_servers”, route “0".
2020-03-10 15:20:00,839 140241378099760 INFO workflows [-] [5e67af6b2c5a45070737a8dc] Identified the following set of tasks to execute next: scan_servers (route 0)

Expected Results

The transitions between tasks should take the same time as when not using with (0-2 seconds).

Actual Results

The transitions between tasks take 30sec+.

m4dcoder commented 4 years ago

@GuiTeK I tested a simple with-items workflow and I cannot reproduce the issue. I'm testing this on 3.2dev. Can you test the following workflow and see if you see this problem on 3.1?

version: 1.0

tasks:
  task1:
    with:
      items: <% range(10) %>
      concurrency: 1
    action: core.echo message="<% str(item()) %>, resistance is futile!"
GuiTeK commented 4 years ago

Hi @m4dcoder !

Indeed, the problem doesn't arise with your example.

We found the cause of the problem: "large" (> 1 MB but < 10 MB) data transfer between the DB and the worker processes. It seems that for each iteration of the loop, the full ctx(servers_to_check) is fetched from the DB. It shouldn't be needed, but it seems that's what happens.

To fix it, we did two things:

The workflow execution time went down from 17 hours to 2 hours.

EDIT: we are using AWS instance db.r4.large for the DocumentDB and t3.medium for the workers.

guzzijones commented 4 years ago

@GuiTeK could you explain this: We configured coordination to run the workflow on several worker machines.

I am also seeing this slowness in with loops, but I am not sure if it is related.

guzzijones commented 4 years ago

I have posted a custom action to the form that resolves this. Eventually I will submit a pack to the st2 exchange once we have tested it out rigorously.

with items action