Open GuiTeK opened 4 years ago
@GuiTeK I tested a simple with-items workflow and I cannot reproduce the issue. I'm testing this on 3.2dev. Can you test the following workflow and see if you see this problem on 3.1?
version: 1.0
tasks:
task1:
with:
items: <% range(10) %>
concurrency: 1
action: core.echo message="<% str(item()) %>, resistance is futile!"
Hi @m4dcoder !
Indeed, the problem doesn't arise with your example.
We found the cause of the problem: "large" (> 1 MB but < 10 MB) data transfer between the DB and the worker processes. It seems that for each iteration of the loop, the full ctx(servers_to_check)
is fetched from the DB. It shouldn't be needed, but it seems that's what happens.
To fix it, we did two things:
ctx(servers_to_check)
: we retrieve only the IDs, and we retrieve the other information we need to have in the sub-task/sub-workflow. It does a lot more HTTP requests, but in the end it's a huge time gain. => Maybe we want to fix this part @m4dcoder?The workflow execution time went down from 17 hours to 2 hours.
EDIT: we are using AWS instance db.r4.large
for the DocumentDB and t3.medium
for the workers.
@GuiTeK could you explain this:
We configured coordination to run the workflow on several worker machines.
I am also seeing this slowness in with loops, but I am not sure if it is related.
I have posted a custom action to the form that resolves this. Eventually I will submit a pack to the st2 exchange once we have tested it out rigorously.
SUMMARY
It seems that when we using the
with
syntax:transitions between two tasks are slow (30sec+). For comparison, transitions between two tasks when not using
with
is 0-2 seconds.This is a problem for us because with a large number of tasks (e.g., if we want to check 500 servers), StackStorm will spend
500 * 30 = 15 000 seconds
over 4 hours just waiting and doing nothing.Questions:
STACKSTORM VERSION
st2 3.0.1, on Python 2.7.12
OS, environment, install method
Ubuntu 16.04,
apt-get install stackstorm
(added APT repositoryhttps://packagecloud.io/StackStorm/stable/ubuntu
)Steps to reproduce the problem
Here is a workflow we use to reproduce the problem. I replaced the real payload of the servers with fake JSON data from https://jsonplaceholder.typicode.com/.
Here we can see the (~ 30 seconds) in the logs as well:
Expected Results
The transitions between tasks should take the same time as when not using
with
(0-2 seconds).Actual Results
The transitions between tasks take 30sec+.