Open nmaludy opened 6 years ago
Maybe this has something to do with it? https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-comapred-to-native-pymongo-usage
I'm sure the parent workflow is very large with all of the results all in one big workflow object.
And here I was wondering what I was going to do in the morning. ;)
I also am hitting this issue with 100s of items in a list. Let me know if there is anything I can provide to help troubleshoot. @nmaludy was kind enough to share his strategy for dealing with this, but would love to get a bump on getting a fix into st2.
Did you encounter this issue when hitting that API endpoint using WebUI (or some other method)?
https://github.com/StackStorm/st2/pull/4300 should help with WebUI performance a lot once that PR is merged and one WebUI starts utilizing it (should be in the next release).
Besides that, I hope to have a look at this exact issue when I get a chance (likely late next week).
@Kami i hit this issue by running a Mistral workflow with a lot of with-items
This isn't so much a "UI rendering slow" issue and more of a "My workflow slows down as the number of executions increases" issue. We use with-items to handle deployment, so with 5-10 items running subworkflows, we have great performance, with 50 items, we have bad performance :-)
Here is a good quantification of the issue. Notice in a 50 item, with-items execution, how each iteration takes longer and longer. If you go to 100, 200, etc... As @nmaludy said, the st2api calls can be seen increasing in latency in the logs. The larger the task size (output specifically, the longer it takes) in the workflow
st2 execution get 5b8442b6cd0fbf5236e3ad00 | grep nested_workflow_1
result_task: nested_workflow_1
| + 5b8442b7cd0fbf5236e3ad03 | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:28:07 UTC |
| + 5b8442c4cd0fbf5236e3ad17 | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:28:20 UTC |
| + 5b8442d2cd0fbf5236e3ad2b | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:28:34 UTC |
| + 5b8442dfcd0fbf5236e3ad3f | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:28:47 UTC |
| + 5b8442eccd0fbf5236e3ad53 | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:29:00 UTC |
| + 5b8442fbcd0fbf5236e3ad67 | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:29:15 UTC |
| + 5b844308cd0fbf5236e3ad7b | succeeded (13s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:29:28 UTC |
| + 5b844316cd0fbf5236e3ad8f | succeeded (13s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:29:42 UTC |
| + 5b844325cd0fbf5236e3ada3 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:29:56 UTC |
| + 5b844334cd0fbf5236e3adb7 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:30:12 UTC |
| + 5b844344cd0fbf5236e3adcb | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:30:28 UTC |
| + 5b844353cd0fbf5236e3addf | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:30:43 UTC |
| + 5b844363cd0fbf5236e3adf3 | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:30:59 UTC |
| + 5b844373cd0fbf5236e3ae07 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:31:14 UTC |
| + 5b844382cd0fbf5236e3ae1b | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:31:30 UTC |
| + 5b844391cd0fbf5236e3ae2f | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:31:45 UTC |
| + 5b8443a1cd0fbf5236e3ae43 | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:32:01 UTC |
| + 5b8443b0cd0fbf5236e3ae57 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:32:16 UTC |
| + 5b8443c0cd0fbf5236e3ae6b | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:32:32 UTC |
| + 5b8443cfcd0fbf5236e3ae7f | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:32:47 UTC |
| + 5b8443e0cd0fbf5236e3ae93 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:33:04 UTC |
| + 5b8443f0cd0fbf5236e3aea7 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:33:20 UTC |
| + 5b844400cd0fbf5236e3aebb | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:33:36 UTC |
| + 5b84440fcd0fbf5236e3aecf | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:33:51 UTC |
| + 5b84441fcd0fbf5236e3aee3 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:34:06 UTC |
| + 5b84442ecd0fbf5236e3aef7 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:34:22 UTC |
| + 5b84443fcd0fbf5236e3af0b | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:34:38 UTC |
| + 5b84444ecd0fbf5236e3af1f | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:34:54 UTC |
| + 5b84445dcd0fbf5236e3af33 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:35:09 UTC |
| + 5b84446dcd0fbf5236e3af47 | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:35:25 UTC |
| + 5b84447dcd0fbf5236e3af5b | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:35:40 UTC |
| + 5b84448dcd0fbf5236e3af6f | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:35:57 UTC |
| + 5b84449dcd0fbf5236e3af83 | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:36:13 UTC |
| + 5b8444aecd0fbf5236e3af97 | succeeded (19s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:36:29 UTC |
| + 5b8444c1cd0fbf5236e3afab | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:36:49 UTC |
| + 5b8444d2cd0fbf5236e3afbf | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:37:06 UTC |
| + 5b8444e4cd0fbf5236e3afd3 | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:37:23 UTC |
| + 5b8444f5cd0fbf5236e3afe7 | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:37:41 UTC |
| + 5b844507cd0fbf5236e3affb | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:37:59 UTC |
| + 5b844518cd0fbf5236e3b00f | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:38:15 UTC |
| + 5b844529cd0fbf5236e3b023 | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:38:33 UTC |
| + 5b84453bcd0fbf5236e3b037 | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:38:51 UTC |
| + 5b84454ccd0fbf5236e3b04b | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:39:08 UTC |
| + 5b84455ecd0fbf5236e3b05f | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:39:26 UTC |
| + 5b844570cd0fbf5236e3b073 | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:39:43 UTC |
| + 5b844581cd0fbf5236e3b087 | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:40:01 UTC |
| + 5b844593cd0fbf5236e3b09b | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:40:19 UTC |
| + 5b8445a6cd0fbf5236e3b0af | succeeded (18s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:40:37 UTC |
| + 5b8445b8cd0fbf5236e3b0c3 | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:40:56 UTC |
| + 5b8445cacd0fbf5236e3b0d7 | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja- | Mon, 27 Aug 2018 18:41:14 UTC |
root@8ade7d214c04:/tmp/st2/contrib/examples/actions/workflows#
I'll try to reproduce this issue today.
The weird thing is that POST operation takes a long time - "[6397] [INFO] [POST] URI /v1/actionexecutions Total time: 100.877137899".
POST aka schedule action execution operation is quite simple so there is no reason it should take that long (it inserts a couple of objects into the database and publishes a message on the messages bus).
It's likely that some other API operation (e.g. GET) is hogging it down, or that there is some weird contention going on on the message bus or similar.
After some digging in - I wonder if the issue lies somewhere else and not in the st2api.
If I run the workflow above which uses Mistral to run a lot of actions in a short time-frame, I noticed large CPU utilization by mistral-server and partially, st2rulesengine.
st2rulesengine utilization is expected, because each execution results in at least one trigger instance.
So could this issue simple be due to resource exhaustion caused by mistral-server hogging most of the available CPU?
@nmaludy @jbrownsc
I tried to replicate this issue on a single server setup.
After some more instrumentation and digging in, I noticed the same behavior as reported above.
When scheduling a lot of executions in a short time-frame through Mistral workflow, mistral-server starts hogging a lot of CPU and as such, everything else (including st2api) slows down due to the lack of resources.
10 - 20 tasks work relatively fine, but as soon as I move above 30 or so, things start to get very slow (and I noticed very high server CPU utilization aka server basically being bogged down). Things gets even worse if I change with-item action from core.local to some Mistral workflow action because it causes additional load on Mistral.
If I concurrency schedule a lot of runs of core.local
action for example, I don't see this issue which seems to back up my Mistral hogging up the CPU claim.
In short - I don't think there is any quick or easy code change we can do at the moment - the root cause seems to be related to inefficiencies in Mistral and there is nothing we can do, besides advising users to dedicate more resources to Mistral and other StackStorm components (ideally running StackStorm services across multiple correctly sized servers).
One thing which my also help (but I didn't have time to verify it) is splitting "one large Mistral workflow with a lot of child tasks / actions" to multiple Mistral workflows with less child tasks.
Going forward, we will ensure with-items with a lot of items (e.g. scheduling a lot of actions through Orquesta workflow) has good performance in Orquesta, our new workflow engine.
@nmaludy Are you running all of the StackStorm components on a single server?
@m4dcoder See above.
One of our "benchmarks" for Orquesta should also be orquesta workflow with a lot of with-items
operations and verifying it has good / reasonable performance.
I don't think this is with-items only. This seems like a known issue with the scheduler in mistral. If you are executing a lot of concurrent workflows thru mistral, mistral will try to schedule all the workflows and tasks at once. As tasks are completed, it will schedule the next tasks immediately. Soon, there're a lot of tasks waiting to be executed and each workflow will take longer and longer to complete because there's not enough st2 action runners to execute them. At this point in the lifecycle for mistral, since it's going to be deprecated soon, we don't intend to put a lot of time into fixing the scheduler. However, @Kami we have a line item on our roadmap to address this in orquesta, our new workflow engine.
The workaround on mistral here is to define a concurrency policy in st2 @ https://docs.stackstorm.com/reference/policies.html#concurrency for the workflow being executed by the with-items task. Also, please note that if you average the completion time of the workflow execution in the with-items, the average time to completion is more or less the same. The same amount of work (or tasks) get completed over time as a function of the number of system resources and st2 action runners.
@m4dcoder I have tested this with a concurrency policy, and the same issues exists. I'm OK with saying this is a won't fix for Mistral, but the core issue of st2api being slow seems like it will exist with Orquestra too, right?
@Kami with regards to the performance in st2api, the POST calls to st2api should not be taking 3-5 seconds correct?
2018-09-04 19:18:40,894 140660106998480 INFO logging [-] 08eb4db0-523f-4d10-a2f7-ee6c6ffe6e66 - 201 3988 4573.698ms (content_length=3988,request_id='08eb4db0-523f-4d10-a2f7-ee6c6ffe6e66',runtime=4573.698,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:43,374 140660124260944 INFO logging [-] 7a137ae4-47ee-44a9-b64e-f0c7cc12dc1e - 201 3988 4626.373ms (content_length=3988,request_id='7a137ae4-47ee-44a9-b64e-f0c7cc12dc1e',runtime=4626.373,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:44,462 140660107001360 INFO logging [-] 7a121281-20a5-4071-9440-6e9e6eda5d19 - 201 3988 6002.316ms (content_length=3988,request_id='7a121281-20a5-4071-9440-6e9e6eda5d19',runtime=6002.316,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:44,491 140660106999760 INFO logging [-] 10c27e29-7ea8-4279-b02f-ef933a6883ef - 201 3988 6562.452ms (content_length=3988,request_id='10c27e29-7ea8-4279-b02f-ef933a6883ef',runtime=6562.452,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:44,634 140660124261104 INFO logging [-] 444e88f0-8b5e-474f-a5a2-cac5961689ec - 201 3988 5301.957ms (content_length=3988,request_id='444e88f0-8b5e-474f-a5a2-cac5961689ec',runtime=5301.957,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:44,720 140660106998480 INFO logging [-] 527f6af9-9bed-450a-ad8e-f8009376fb68 - 201 3988 3363.479ms (content_length=3988,request_id='527f6af9-9bed-450a-ad8e-f8009376fb68',runtime=3363.479,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:48,806 140660107001040 INFO logging [-] 789239ac-277f-47ab-87fb-60289625e112 - 201 3988 4801.581ms (content_length=3988,request_id='789239ac-277f-47ab-87fb-60289625e112',runtime=4801.581,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:49,820 140660106999920 INFO logging [-] 654f3164-5860-4406-932b-d659edd18fba - 201 3988 5800.771ms (content_length=3988,request_id='654f3164-5860-4406-932b-d659edd18fba',runtime=5800.771,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:49,855 140660124260944 INFO logging [-] f4e36f80-3762-4bf3-8363-1ea9e08f5e8f - 201 3988 5836.541ms (content_length=3988,request_id='f4e36f80-3762-4bf3-8363-1ea9e08f5e8f',runtime=5836.541,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:49,999 140660100791344 INFO logging [-] 12a4ff02-3a55-471b-988e-98bf72b56ef8 - 201 3988 4796.937ms (content_length=3988,request_id='12a4ff02-3a55-471b-988e-98bf72b56ef8',runtime=4796.937,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:50,028 140660100791664 INFO logging [-] 5e0d0812-7911-4b77-938d-dec01df0a7f7 - 201 3988 4824.118ms (content_length=3988,request_id='5e0d0812-7911-4b77-938d-dec01df0a7f7',runtime=4824.118,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:53,794 140660106998000 INFO logging [-] be911c6a-e888-4a8f-95e5-53146d983ee6 - 201 3988 3271.721ms (content_length=3988,request_id='be911c6a-e888-4a8f-95e5-53146d983ee6',runtime=3271.721,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:54,913 140660100788624 INFO logging [-] 9258e4bc-d473-42c4-b637-90c94ceb104e - 201 3988 5142.16ms (content_length=3988,request_id='9258e4bc-d473-42c4-b637-90c94ceb104e',runtime=5142.16,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:55,988 140660100791504 INFO logging [-] bdc16a90-3dc6-4719-a930-5bc260d72337 - 201 3988 6218.387ms (content_length=3988,request_id='bdc16a90-3dc6-4719-a930-5bc260d72337',runtime=6218.387,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:56,035 140660107000240 INFO logging [-] 9785f5fb-493d-44d6-910b-12c6aeaa0e4c - 201 3988 5516.225ms (content_length=3988,request_id='9785f5fb-493d-44d6-910b-12c6aeaa0e4c',runtime=5516.225,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
@jbrownsc I'm only commenting on the performance relating to Mistral workflows and with-items. I'll let @Kami take the question on st2api performance. As for trying it out w/ a concurrency policy, what's the actual policy spec?
EDITED... WORKFLOW REPLACED WITH https://github.com/StackStorm/st2/issues/4030#issuecomment-418495122
@jbrownsc we don't support concurrency policies on inline subworkflow (i.e. subworkflow in a mistral workbook). Try to make another_workflow
into another st2 action. Then write a concurrency policy on another_workflow
that limits 5 executions at a time as a baseline. The concurrency attribute defined within the Mistral workbook only has limited scope. If you use the st2 concurrency policy (a separate policy file), it will limit the concurrency on that action system wide. Anyway, you can give that a try.
Here is the new workflow.
with-items-test.yaml
---
name: "with-items-test"
description: "For testing things in Mistral"
runner_type: mistral-v2
entry_point: workflows/with-items-test.yaml
enabled: true
parameters:
workflow:
default: voltron.with-items-test.main
with-items-test-subworkflow.yaml
---
name: "with-items-test-subworkflow"
description: "For testing things in Mistral"
runner_type: mistral-v2
entry_point: workflows/with-items-test-subworkflow.yaml
enabled: true
parameters:
workflow:
default: voltron.with-items-test-subworkflow.main
workflows/with-items-test.yaml
version: '2.0'
name: 'voltron.with-items-test'
description: Noop
workflows:
main:
type: direct
tasks:
with_items_test:
with-items: "blah in {{ range(0,250) }}"
concurrency: 5
action: voltron.with-items-test-subworkflow
workflows/with-items-test-subworkflow.yaml
version: '2.0'
name: 'voltron.with-items-test-subworkflow'
description: Easy-peasy subworkflow
workflows:
main:
type: direct
tasks:
test_1:
action: core.local
input:
cmd: echo test_1
on-success:
- test_2
test_2:
action: core.local
input:
cmd: echo test_2
on-success:
- test_3
test_3:
action: core.local
input:
cmd: echo test_3
on-success:
- test_4
test_4:
action: core.local
input:
cmd: echo test_4
on-success:
- test_5
test_5:
action: core.local
input:
cmd: echo test_5
From tcpdump
Host: 127.0.0.1:9101
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.19.1
st2-context: {"parent": {"parent": {"user": "admin", "execution_id": "5b8ee0b6fa214313ae6b77ea", "pack": "voltron"}, "pack": "voltron", "user": "admin", "execution_id": "5b8ee2cafa214313ae6b7cfa", "mistral": {"action_execution_id": "35197061-55a2-483e-bd3b-2bf6e5a9d922", "workflow_name": "voltron.with-items-test.main", "task_execution_id": "2df8478b-fe37-4e5a-98ee-132f1e69093b", "workflow_execution_id": "653d5d03-a769-491d-9416-8e037bf095de", "task_name": "with_items_test", "callback_url": "/v2/action_executions/35197061-55a2-483e-bd3b-2bf6e5a9d922"}}, "mistral": {"action_execution_id": "aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2", "workflow_name": "voltron.with-items-test-subworkflow.main", "task_execution_id": "dbf5f312-6674-4c0b-8b6e-793ec2695ae4", "workflow_execution_id": "7eb62944-0900-4d0b-9473-0bfec9307d31", "task_name": "test_5", "callback_url": "/v2/action_executions/aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2"}}
Content-Type: application/json
Content-Length: 202
{"action": "core.local", "callback": {"url": "http://0.0.0.0:8989/v2/action_executions/aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2", "source": "mistral_v2"}, "notify": {}, "parameters": {"cmd": "echo test_5"}}
HTTP/1.1 201 Created
Server: gunicorn/19.8.1
Date: Tue, 04 Sep 2018 19:54:19 GMT
Connection: keep-alive
Content-Type: application/json
Content-Length: 4637
Set-Cookie: auth-token=03c0a58ef94a4330b344bdfa3e9061fb; Max-Age=86376; Path=/; expires=Wed, 05-Sep-2018 19:53:49 GMT; HttpOnly
Access-Control-Allow-Origin: http://127.0.0.1:3000
Access-Control-Allow-Methods: GET,POST,PUT,DELETE,OPTIONS
Access-Control-Allow-Headers: Content-Type,Authorization,X-Auth-Token,St2-Api-Key,X-Request-ID
Access-Control-Allow-Credentials: true
Access-Control-Expose-Headers: Content-Type,X-Limit,X-Total-Count,X-Request-ID
X-Request-ID: c701ac2e-fa42-49ec-a0af-5238d8180b19
{
"status": "requested",
"start_timestamp": "2018-09-04T19:54:13.433146Z",
"parent": "5b8ee2cafa214313ae6b7cfa",
"parameters": {
"cmd": "echo test_5"
},
"runner": {
"runner_module": "local_shell_command_runner",
"uid": "runner_type:local-shell-cmd",
"runner_package": "local_runner",
"description": "A runner to execute local actions as a fixed user.",
"enabled": true,
"runner_parameters": {
"sudo": {
"default": false,
"type": "boolean",
"description": "The command will be executed with sudo."
},
"sudo_password": {
"default": null,
"secret": true,
"required": false,
"type": "string",
"description": "Sudo password. To be used when paswordless sudo is not allowed."
},
"env": {
"type": "object",
"description": "Environment variables which will be available to the command(e.g. key1=val1,key2=val2)"
},
"cmd": {
"type": "string",
"description": "Arbitrary Linux command to be executed on the host."
},
"kwarg_op": {
"default": "--",
"type": "string",
"description": "Operator to use in front of keyword args i.e. \"--\" or \"-\"."
},
"timeout": {
"default": 60,
"type": "integer",
"description": "Action timeout in seconds. Action will get killed if it doesn't finish in timeout seconds."
},
"cwd": {
"type": "string",
"description": "Working directory where the command will be executed in"
}
},
"id": "5b610d7ab2a78503a0c1e95a",
"name": "local-shell-cmd"
},
"web_url": "https://2f7ce1346250/#/history/5b8ee2e8fa214313ae6b7d29/general",
"context": {
"pack": "core",
"user": "admin",
"parent": {
"parent": {
"user": "admin",
"execution_id": "5b8ee0b6fa214313ae6b77ea",
"pack": "voltron"
},
"pack": "voltron",
"user": "admin",
"execution_id": "5b8ee2cafa214313ae6b7cfa",
"mistral": {
"action_execution_id": "35197061-55a2-483e-bd3b-2bf6e5a9d922",
"workflow_name": "voltron.with-items-test.main",
"task_execution_id": "2df8478b-fe37-4e5a-98ee-132f1e69093b",
"workflow_execution_id": "653d5d03-a769-491d-9416-8e037bf095de",
"task_name": "with_items_test",
"callback_url": "/v2/action_executions/35197061-55a2-483e-bd3b-2bf6e5a9d922"
}
},
"mistral": {
"action_execution_id": "aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2",
"workflow_name": "voltron.with-items-test-subworkflow.main",
"task_execution_id": "dbf5f312-6674-4c0b-8b6e-793ec2695ae4",
"workflow_execution_id": "7eb62944-0900-4d0b-9473-0bfec9307d31",
"task_name": "test_5",
"callback_url": "/v2/action_executions/aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2"
}
},
"action": {
"description": "Action that executes an arbitrary Linux command on the localhost.",
"runner_type": "local-shell-cmd",
"tags": [],
"enabled": true,
"pack": "core",
"entry_point": "",
"notify": {},
"uid": "action:core:local",
"parameters": {
"cmd": {
"required": true,
"type": "string",
"description": "Arbitrary Linux command to be executed on the local host."
},
"sudo": {
"immutable": true
}
},
"ref": "core.local",
"id": "5b610d7ab2a78503a0c1e963",
"name": "local"
},
"liveaction": {
"runner_info": {},
"parameters": {
"cmd": "echo test_5"
},
"action_is_workflow": false,
"callback": {
"url": "http://0.0.0.0:8989/v2/action_executions/aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2",
"source": "mistral_v2"
},
"action": "core.local",
"id": "5b8ee2e5fa214313ae6b7d25"
},
"id": "5b8ee2e8fa214313ae6b7d29",
"log": [
{
"status": "requested",
"timestamp": "2018-09-04T19:54:14.544513Z"
}
]
}
I turned Mongodb profiling level to 2 (https://docs.mongodb.com/manual/reference/method/db.setProfilingLevel/) and I see a lot of createIndexes
calls as well. When I cancel the execution for the workflow above, those calls go away. So I'm guessing st2api is making those calls to Mongo.
2018-09-04T20:15:00.017+0000 I COMMAND [conn552] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "status_1", key: { status: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.021+0000 I COMMAND [conn31] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "trace_tag_1", key: { trace_tag: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.021+0000 I COMMAND [conn551] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "start_timestamp_1", key: { start_timestamp: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.022+0000 I COMMAND [conn372] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "action_executions.object_id_1", key: { action_executions.object_id: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.023+0000 I COMMAND [conn568] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "trigger_instances.object_id_1", key: { trigger_instances.object_id: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.024+0000 I COMMAND [conn369] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "rules.object_id_1", key: { rules.object_id: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.025+0000 I COMMAND [conn552] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "start_timestamp_-1_trace_tag_1", key: { start_timestamp: -1, trace_tag: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.032+0000 I COMMAND [conn31] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "occurrence_time_1", key: { occurrence_time: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.034+0000 I COMMAND [conn551] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "trigger_1", key: { trigger: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.038+0000 I COMMAND [conn372] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "occurrence_time_-1_trigger_1", key: { occurrence_time: -1, trigger: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.040+0000 I COMMAND [conn568] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "status_1", key: { status: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.052+0000 I COMMAND [conn551] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "occurrence_time_1", key: { occurrence_time: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.053+0000 I COMMAND [conn372] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "trigger_1", key: { trigger: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.053+0000 I COMMAND [conn568] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "occurrence_time_-1_trigger_1", key: { occurrence_time: -1, trigger: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.054+0000 I COMMAND [conn369] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "status_1", key: { status: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
Also looking at the MongoDB query log, I see about 399 queries in 1 second.
cat /tmp/mongo_queries_1_second | perl -ne 'print "$1\n" if /command: (.*?)\s/' | sort | uniq -c | sort -rn
188 createIndexes
160 find
17 update
11 insert
4 count
1 delete
I still think @nmaludy pointed out the DictField serialization glory. https://github.com/MongoEngine/mongoengine/issues/1230 This also references https://github.com/MongoEngine/mongoengine/issues/1230#issuecomment-316308308 which encapsulates the index issue.
How does the server and service load look like (cpu and io usage) when you experience those issues?
I was able to reproduce slow API response the other day, but that was due to mistral-server hogging most of the CPU aka server being overloaded and nothing in particular to do with st2api.
Again, would need to have more information (specifically server and service CPU usage) so we can get a clearer picture on what is going on.
When i've seen this happen, st2api
is consuming 1 full CPU (100%), the mistral-server
is consuming most of a CPU (75-100% back and forth), then the rest of the box has minimal usage. These hosts have 8vCPU and 16 GB of memory so there are plenty of other resources on the box for other services to consume, including the kernel.
All all of those services running on the same box? Did you try pinning st2api and mistral-server process to different cores / CPUs ? Both service are single threaded.
On a related note - how does MongoDB resource utilization look it? As mentioned before, "schedule execution" API operation is relatively simple, it does the following:
None of those operations are particularity complex. Only 1) and 2) could get somewhat slower if you are doing something crazy with the config / action parameters. So it's unlikely that the operation itself is the culprit.
It looks like I will need to try to reproduce this again on a server with more resource. Last time I reproduced it, mistral-server was the bottleneck and cause for slow st2api responses and large CPU usage.
On a related note - I'm also working on some improvements to our metrics framework (already made a lot of those targeted for v2.9.0 release).
It will allow StackStorm services to easily send CPU and memory usage to the metrics backend (statsd at the moment). This should make troubleshooting issues like that much easier in the future.
st2api pegs 1 CPU. Mongo, Postgres, Redis, Rabbitmq all seem like they're doing something, but nothing major. Disk access is around 60/100 write iops on Mongo.
ST2 Process Usage
ps auxf | sort -rnk 3 | head -n5
root 9595 36.3 0.8 201224 99036 pts/5 R+ 14:56 4:33 \_ /opt/stackstorm/st2/bin/python /opt/stackstorm/st2/bin/gunicorn st2api.wsgi:application -k eventlet -b 127.0.0.1:9101 --workers 1 --threads 1 --graceful-timeout 10 --timeout 30
mistral 396 3.4 1.0 295400 132836 ? Ss Sep03 93:22 /opt/stackstorm/mistral/bin/python /opt/stackstorm/mistral/bin/mistral-server --server engine,executor,notifier --config-file /etc/mistral/mistral.conf --log-file /var/log/mistral/mistral-server.log
st2 52 3.3 0.7 191560 93488 ? Ss Sep03 89:19 /opt/stackstorm/st2/bin/python /opt/stackstorm/st2/bin/st2notifier --config-file /etc/st2/st2.conf
root 5120 1.8 0.6 106688 85376 ? Ss 09:01 6:57 tmux -2 -f /usr/share/byobu/profiles/tmuxrc new-session -n - /usr/bin/byobu-shell
mistral 9592 1.5 0.7 259672 96236 ? S 14:56 0:11 \_ /opt/stackstorm/mistral/bin/python /opt/stackstorm/mistral/bin/gunicorn --log-file /var/log/mistral/mistral-api.log -b 127.0.0.1:8989 -w 2 mistral.api.wsgi --graceful-timeout 10
Mongo IO
avg-cpu: %user %nice %system %iowait %steal %idle
44.49 0.00 6.53 0.20 0.00 48.78
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
scd1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 82.00 0.00 54.00 0.00 616.00 22.81 0.01 0.19 0.00 0.19 0.19 1.00
avg-cpu: %user %nice %system %iowait %steal %idle
42.39 0.00 11.11 0.21 0.00 46.30
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
scd1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 118.00 0.00 83.00 0.00 924.00 22.27 0.02 0.24 0.00 0.24 0.24 2.00
avg-cpu: %user %nice %system %iowait %steal %idle
41.53 0.00 6.05 0.20 0.00 52.22
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
scd1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 58.00 0.00 54.00 0.00 512.00 18.96 0.00 0.00 0.00 0.00 0.00 0.00
Mongodb CPU
ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mongodb 1 1.7 3.0 1340620 368672 ? Ssl Sep03 47:36 mongod
RabbitMQ
ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 28876 0.0 0.0 18204 3192 pts/0 Ss 09:31 0:00 bash
root 33734 0.0 0.0 36632 2764 pts/0 R+ 15:11 0:00 \_ ps auxf
rabbitmq 1 0.0 0.0 4288 1576 ? Ss Sep03 0:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 182 0.0 0.0 30540 248 ? S Sep03 0:02 /usr/lib/erlang/erts-8.2.1/bin/epmd -daemon
rabbitmq 314 1.6 1.5 4454428 184392 ? Sl Sep03 44:06 /usr/lib/erlang/erts-8.2.1/bin/beam.smp -W w -A 80 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -K true -B i
rabbitmq 429 0.0 0.0 4180 1176 ? Ss Sep03 0:08 \_ erl_child_setup 1048576
rabbitmq 444 0.0 0.0 11424 960 ? Ss Sep03 0:00 \_ inet_gethost 4
rabbitmq 445 0.0 0.0 13544 1508 ? S Sep03 0:00 \_ inet_gethost 4
Postgresql
ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2768 1.2 0.0 19868 3704 pts/0 Ss 15:12 0:00 bash
root 2777 0.0 0.0 38304 3336 pts/0 R+ 15:12 0:00 \_ ps auxf
postgres 1 0.0 0.1 287492 23700 ? Ss Sep03 0:07 postgres
postgres 26 0.0 0.8 287612 104888 ? Ss Sep03 0:07 postgres: checkpointer process
postgres 27 0.0 0.0 287492 3796 ? Ss Sep03 0:05 postgres: writer process
postgres 28 0.0 0.0 287492 8464 ? Ss Sep03 0:07 postgres: wal writer process
postgres 29 0.0 0.0 287892 6884 ? Ss Sep03 0:02 postgres: autovacuum launcher process
postgres 30 0.0 0.0 142624 4540 ? Ss Sep03 0:34 postgres: stats collector process
postgres 2660 0.1 0.9 289584 119612 ? Ss 07:12 0:57 postgres: mistral-user mistral 192.168.208.6(44226) idle
postgres 2661 0.1 0.9 289796 119636 ? Ss 07:12 0:51 postgres: mistral-user mistral 192.168.208.6(44228) idle
postgres 2662 0.1 0.9 289556 119416 ? Ss 07:12 0:51 postgres: mistral-user mistral 192.168.208.6(44248) idle in transaction
postgres 2663 0.1 0.9 289528 119684 ? Ss 07:12 0:55 postgres: mistral-user mistral 192.168.208.6(44254) idle
postgres 2755 0.8 0.8 289428 106000 ? Ss 14:51 0:10 postgres: mistral-user mistral 192.168.208.6(35828) idle
postgres 2756 0.7 0.8 289432 105376 ? Ss 14:51 0:09 postgres: mistral-user mistral 192.168.208.6(35848) idle
postgres 2765 0.0 0.1 289316 14092 ? Ss 14:56 0:00 postgres: mistral-user mistral 192.168.208.6(36606) idle
postgres 2766 0.1 0.1 289324 14092 ? Ss 14:56 0:01 postgres: mistral-user mistral 192.168.208.6(36608) idle
postgres 2767 0.0 0.1 289192 14544 ? Ss 15:04 0:00 postgres: mistral-user mistral 192.168.208.6(38164) idle
We have provided a pretty usable test case in https://github.com/StackStorm/st2/issues/4030#issuecomment-418495122
On a related note - which version of StackStorm are you using? I tested it with the latest version (Mistral callbacks are used and not the query service).
In any case, I / we need to approach this issue more systematically. There are a lot of potential causes mentioned this issue, but this could be unrelated to this issue.
First step should be trying to replicate st2api execution schedule issue by taking Mistral our of the loop (using apachebench directly on POST /v1/executions - I will look into that).
It's unlikely that mongoengine issue mentioned above (large dict issue) would affect this, because as mentioned above POST to /v1/executions doesn't include any large data and a small and simple object is inserted into the database.
I was able to reproduce POST /v1/executions
API endpoint being slow during multiple concurrent HTTP requests using apache bench.
There is a lot of going on in that controller, so it will be slow process to find the root cause (could be a lot of small short blocking operations such as render live params, etc. blocking the main eventlet loop).
From what I've seen, "get" queries in that API endpoint doesn't seem to be an issue. There are quite many of them, but all of them return in a sub-millisecond and use an index. None of them also return a lot of data.
@Kami Thanks for looking further. Perhaps the payload of the POST is not the issue, but the concurrency as you mentioned. I don't know how st2api > mongoengine > mongo will handle the createIndexes operations, but I can only assume it's causing some form of blocking. Is there any way we can run your apache bench test with "auto_create_index": False
?
Version
root@bb05c9f45aac:/# st2 --version
st2 2.8.1, on Python 2.7.6
Some more updates - I added a lot of instrumentation and most of the operations performed inside that API controller (including mongo stuff) finish relatively fast.
Problem is that this controller does a lot of things (some of it duplicated) and things simply add up cumulatively and result in poor performance.
I still don't think it's specifically related to mongoengine issue mentioned above, because that endpoint works with small documents and even if I comment out most "heavy" MongoDB operations, performance is still bad.
So having said that, only option we might have to make performance of that API endpoint somewhat decent (compared to scheduling actions directly via rules engine) might be re-creating that API endpoint from the ground up, trying to keep list of steps it performs to minimum.
In the mean time, a not so great, but still a work-around would be to run more st2api processes and load-balance between them.
I'm 100% ok with
In the mean time, a not so great, but still a work-around would be to run more st2api processes and load-balance between them.
Going to test now.
Scaling the workers to 10, using a concurrency policy of 5, seems to have given a pretty sizable increase in performance. None of the requests are over 2.5 seconds, which previously they would be up to 12 seconds as time went on.
2018-09-05 17:17:52,311 140528773867472 INFO logging [-] f59d8708-999e-4dfd-a9ff-48a17499cd5d - 201 4637 1616.369ms (content_length=4637,request_id='f59d8708-999e-4dfd-a9ff-48a
17499cd5d',runtime=1616.369,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:52,821 140528798506352 INFO logging [-] e29c1d43-6f56-49d6-9928-db647dae9f72 - 201 4637 1622.164ms (content_length=4637,request_id='e29c1d43-6f56-49d6-9928-db6
47dae9f72',runtime=1622.164,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:53,440 140528798485392 INFO logging [-] d7ffba69-ffae-4cf5-a678-1fe164b73bcb - 201 4637 1436.057ms (content_length=4637,request_id='d7ffba69-ffae-4cf5-a678-1fe
164b73bcb',runtime=1436.057,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:54,628 140528773871088 INFO logging [-] 4a53007f-2a45-4148-ba3e-7054419ce30e - 201 3338 1243.3ms (content_length=3338,request_id='4a53007f-2a45-4148-ba3e-70544
19ce30e',runtime=1243.3,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:55,012 140528798483792 INFO logging [-] 72d48943-9846-4fc8-b6e9-9e7b7122bd87 - 201 4637 1289.648ms (content_length=4637,request_id='72d48943-9846-4fc8-b6e9-9e7
b7122bd87',runtime=1289.648,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:56,261 140528798488688 INFO logging [-] 006cb59d-44a0-4ba0-8669-6673f14eb49a - 201 3338 1436.544ms (content_length=3338,request_id='006cb59d-44a0-4ba0-8669-667
3f14eb49a',runtime=1436.544,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:56,345 140528773864016 INFO logging [-] 61ad23ad-199d-48c8-91bc-15d768368cea - 201 3338 1524.373ms (content_length=3338,request_id='61ad23ad-199d-48c8-91bc-15d
768368cea',runtime=1524.373,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:56,569 140528773871088 INFO logging [-] fda71d93-657c-4481-9a95-3eaeefd77e7f - 201 4637 1628.84ms (content_length=4637,request_id='fda71d93-657c-4481-9a95-3eae
efd77e7f',runtime=1628.84,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:56,901 140528773867472 INFO logging [-] f7421eb5-c465-4c9b-a09d-39729c345ede - 201 4637 1609.835ms (content_length=4637,request_id='f7421eb5-c465-4c9b-a09d-397
29c345ede',runtime=1609.835,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
Previous runtime with st2api worker process (1 worker + 1 thread)
| + 5b8ed90cfa214313ae6b6e1a | voltron.with-items-test | admin | succeeded | Tue, 04 Sep 2018 19:12:12 | Tue, 04 Sep 2018 19:37:54 |
| | | | (1542s elapsed) | UTC | UTC |
Current runtime with st2api worker process (10 worker + 1 thread)
| + 5b900d9b54f6151614f34514 | voltron.with-items-test | admin | succeeded (739s elapsed) | Wed, 05 Sep 2018 17:08:43 | Wed, 05 Sep 2018 17:21:02 |
| | | | | UTC | UTC |
@jbrownsc
I'm 100% ok with In the mean time, a not so great, but still a work-around would be to run more st2api processes and load-balance between them. Going to test now.
Yeah this should help. Single process performance is not great, but it scales horizontally so running more workers (until saturating all the available CPU cycles) should definitely help.
If you can easily do it, it would also be great if you can test changes from here - https://github.com/StackStorm/st2/pull/4331. Wonder how much it increases performance in your specific scenario.
I can let you know once PR is merged into master and new v2.9dev packages are built.
Today I tested changes from #4331 using latest v2.9dev packages and workflows from https://github.com/StackStorm/st2/issues/4030#issuecomment-418495122 on AWS t2.medium instance.
I initially tested with range 250, but then changed it to 50 so things complete faster (on v2.8.1 and small AWS instance, range 250 simply takes way too long).
ubuntu@ip-172-31-24-123:/opt/stackstorm/packs/default$ st2 --version
st2 2.8.1, on Python 2.7.12
| + 5b96348bbe2a4332e2a2dca1 | default.with-items-test | st2admin | succeeded | Mon, 10 Sep 2018 09:08:27 | Mon, 10 Sep 2018 10:34:20 UTC |
| | | | (5153s elapsed) | UTC | |
+----------------------------+-------------------------+--------------+--------------------------+-----------------------------+-------------------------------+
5153 seconds
| | | | (5153s elapsed) | UTC | |
| + 5b964a19be2a43669a8498d6 | default.with-items-test | st2admin | succeeded (619s elapsed) | Mon, 10 Sep 2018 10:40:25 | Mon, 10 Sep 2018 10:50:44 UTC |
| | | | | UTC | |
+----------------------------+-------------------------+--------------+--------------------------+-----------------------------+-------------------------------+
619 seconds
ubuntu@ip-172-31-18-251:~$ st2 --version
st2 2.9dev (c4db536), on Python 2.7.12
| + 5b96364c88b07d3318ad9041 | default.with-items-test | st2admin | succeeded | Mon, 10 Sep 2018 09:15:56 | Mon, 10 Sep 2018 09:51:44 UTC |
| | | | (2148s elapsed) | UTC | |
+----------------------------+-------------------------+--------------+--------------------------+-----------------------------+-------------------------------+
2148 seconds
| + 5b96414d88b07d3318ad9bfc | default.with-items-test | st2admin | succeeded (157s elapsed) | Mon, 10 Sep 2018 10:02:53 | Mon, 10 Sep 2018 10:05:30 UTC |
| | | | | UTC | |
157 seconds
From those numbers it looks like the speed up is more than 50% for this particular use case (scheduling actions via Mistral / StackStorm API) and workflow.
A couple of things:
Keep in mind that v2.8.1 version uses Mistral callbacks approach for getting Mistral results back into StackStorm. If you are using an older versions which uses polling approach, the results will differ.
st2api should scale horizontally so the throughput should also scale horizontally if there are spare CPU cycles and / or cores available (it's probably not perfect linear scaling, but should be close to it).
In my setup scaling st2api worker processes beyond 1 didn't make sense because I used a small AWS instance and CPU was already fully utilized by all the StackStorm services with a single st2api worker process (in fact, as expected, the performance was worse, because it just saturated already over-saturated CPU even more).
Is there any update on this, we still see rather slow performance with mongodb for large datasets.
Working recently with a bunch of large workflows and seeing the
st2api
service using 100% CPU.I turned on a wsgi profiler (https://medium.com/@maxmaxmaxmax/measuring-performance-of-python-based-apps-using-gunicorn-and-cprofile-ee4027b2be41) and am seeing profile stacktraces like so:
And general times reaching 100+ seconds as i let it go for longer and longer
This is reproduced using the following action workflow: