Open garadar opened 1 year ago
Are you doing this via re-run of previous runs? Or brand new flow runs?
most likely due to mongodb bottlenecking. Please post your workflow so we can explain further details, though.
Thank you, Here my workflow:
^
version: 1.0
description: "Engage the deletion process"
input:
- username
# Be careful, to the filename
vars:
- tmp_mail_path: '/opt/stackstorm/packs/blablabla/actions/workflows/template'
- mail_file: '<% ctx(tmp_mail_path) %>/<% ctx(username) %>.mail'
tasks:
check_key_store:
action: core.local_sudo
input:
cmd: st2 key get deleting_process_<% ctx(username) %>
next:
-
when: <% failed() %>
do: get_user_info
- when: <% succeeded() %>
do: noop
# ## start engage deletion normal
get_user_info:
action: unige.get_user_info
input:
username: <% ctx().username %>
next:
-
publish:
- user_info: <% result().output.result %>
when: <% succeeded() %>
do:
- check_lastlog
-
when: <% failed() %>
publish:
- user_info: <% dict(user => ctx().username, gecos => "", mail => "", lastlog => "", PIname => "", PIgecos => "", PImail => "", quota => "") %>
do: generate_mail_deletion_direct
check_lastlog:
action: core.local
input:
cmd: test <% ctx(user_info).lastlog %> -lt $(date -d "now - 11 month" +%Y%m%d)
next:
- when: <% succeeded() %>
do: generate_mail_deletion_direct
generate_mail_deletion_direct:
action: core.local_sudo
input:
cmd: |
sed -e "s/XXX_USER_XXX/<% ctx(user_info).user %>/g" \
-e "s/XXX_USER_MAIL_XXX/<% ctx(user_info).mail %>/g" \
-e "s/XXX_USER_NAME_XXX/<% ctx(user_info).gecos %>/g" \
<% ctx(tmp_mail_path) %>/mail_deletion_first_contact.mail > <% ctx(mail_file) %>.firstcontact
next:
-
do: send_mail
send_mail:
action: core.local
input:
cmd: sendmail -vt < <% ctx(mail_file) %>.firstcontact
next:
- do: get_current_date
get_current_date:
action: core.local
input:
cmd: date +%Y%m%d
next:
- publish:
- current_date: <% result().stdout %>
do:
- get_deadline
get_deadline:
action: core.local
input:
cmd: date -d "today +1 month" +%Y%m%d
next:
- publish:
- deadline: <% result().stdout %>
do:
- set_key_store
set_key_store:
action: st2.kv.set_object
input:
key: deleting_process_<% ctx(username) %>
value: '{"start_time": <% ctx(current_date) %>, "deadline": <% ctx(deadline) %>}'
## End engage deletion
Are you doing this via re-run of previous runs? Or brand new flow runs?
Each time is a new execution of the workflows
There is only 1 workflow engine. so if the context of this workflow starts getting large that will slow down a workflow itself. I don't see any obvious large jsons here. How many st2action runners do you have running. likely you are saturating them. Add more st2 action runners and see what happens.
Also you could offload the local_sudo to a remote vm using core.remote. This way your stackstorm server doesn't have to take the load of sending and formatting the email.
There is only 1 workflow engine. so if the context of this workflow starts getting large that will slow down a workflow itself. I don't see any obvious large jsons here. How many st2action runners do you have running. likely you are saturating them. Add more st2 action runners and see what happens.
Isn't this an obvious bottleneck though? I'm doing some performance testing currently. To illustrate, I'm reading ServiceNow workgroups (16 of them in parallel) to look for incidents to process and I'm further processing the returned JSONs (ranging from few tickets to JSONs 0.25M size) by applying various filters and python scripts on them. What I've noticed is that the workflow engine is fully utilizing a single core to 99% while the rest of the system is relatively idle, resulting in dramatic performance decrease of workflows while the server is 15% CPU loaded. This also manifests in a workflow consisting of 4 tasks that take 3 seconds each having a runtime of 25+ seconds due to slow task transitions. I've set the worker amount to 30 and also increased the worker thread pool size and action thread pool size.
Current HW setup is a Proliant DL 360 with 16 cores and 64G memory and the instance is not running in a container, with both the instance and MongoDB on a single server.
Can these issues be alleviated in a HA setup with multiple st2workflowengine containers?
@fdrab From what you've described, I suggest you write a custom sensor to handle the polling and data processing and generate event triggers that would call the appropriate workflow via rule match. This would unburden the workflow engine from all the data processing.
So it's not recommended to run several workflow by hand ? we should run them via sensor => rules => execution ?
I am working on a pr in orquesta to remove all the deepcopy. Also we are working to add a zstandard compression and remove redundant action data. Likely you are hitting the max the single workflow engine can handle.
@guzzijones This is great, thank you! My use case involves processing of a lot of ServiceNow workgroups and executing a LOT of workflows on tickets, oftentimes simultaneously (due to the polling nature). I've already written a sensor for querying and data processing so that we don't rely on orquesta workflows to do the job, thereby offloading this work to a dedicated sensor process.
However, I'd like to point out that one of the core features of ST2 is the ability to create complex workflows out of simple operations and use those complex workflows in even more complex flows. Therefore having to rely on a single st2workflowengine in non-HA setting, which is relatively easily slowed down to a crawl, seems a bit puzzling. I wish we would be able to increase the number of these processes even on standalone installation, same as we can with st2actionrunner processes.
Be aware that sensors die silently after 2 failures and their state is not saved.
You could copy the systemd script to add more workflow engines.
Be aware that sensors die silently after 2 failures and their state is not saved.
what do you mean by failures? during development my sensors died when my script threw an exception, so I wrapped a lot of what poll() does in in try / except blocks, so the sensor will just log an error message if something fails. I can save state in the datastore if I need to I think. I can perhaps also have some sort of monitoring to check whether the appropriate sensorcontainer.log hasn't been updated in 20 minutes and do something if that's the case.
@guzzijones This is great, thank you! My use case involves processing of a lot of ServiceNow workgroups and executing a LOT of workflows on tickets, oftentimes simultaneously (due to the polling nature). I've already written a sensor for querying and data processing so that we don't rely on orquesta workflows to do the job, thereby offloading this work to a dedicated sensor process.
However, I'd like to point out that one of the core features of ST2 is the ability to create complex workflows out of simple operations and use those complex workflows in even more complex flows. Therefore having to rely on a single st2workflowengine in non-HA setting, which is relatively easily slowed down to a crawl, seems a bit puzzling. I wish we would be able to increase the number of these processes even on standalone installation, same as we can with st2actionrunner processes.
I deployed a lot of replicated st2 services with K8S HA, and now there is a bottleneck in performance. In the initial optimization, I changed a mongo slow index, which improved some speed, but now I don't know where to expand performance.
how many st2workflowengine runners do you have running? The workflow engine has to write parameters, then the action runner writes it's parameters. again, I wouldn't use sensors.
how many st2workflowengine runners do you have running? The workflow engine has to write parameters, then the action runner writes it's parameters. again, I wouldn't use sensors. I built a new issues.
6003
Issue Description
When I launch multiple workflows at the same time in StackStorm, I've noticed that the execution time of my workflows increases significantly.
STACKSTORM VERSION
st2 3.7.0, on Python 3.6.8
OS, environment, install method
Post what OS you are running this on, along with any other relevant information/
Steps to Reproduce
Launch multiple workflows at the same time in StackStorm Observe the execution time of each workflow
Expected Behavior
The execution time of each workflow should remain constant, regardless of how many workflows are launched simultaneously.
Actual Behavior
The execution time of each workflow increases as more workflows are launched simultaneously.
Here the example of multiple launch of the same workflows, the exec time is almost 25 sec or 2 sec (due to a cancel check) for a normal behavior. We can see, more I run workflow, more the exec time increase.
the issue seems to be that the workflow actions are getting stuck in a "pending" state, and we are not seeing any CPU overload in htop.