StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.08k stars 747 forks source link

st2api being bogged down by MongoEngine #4030

Open nmaludy opened 6 years ago

nmaludy commented 6 years ago

Working recently with a bunch of large workflows and seeing the st2api service using 100% CPU.

I turned on a wsgi profiler (https://medium.com/@maxmaxmaxmax/measuring-performance-of-python-based-apps-using-gunicorn-and-cprofile-ee4027b2be41) and am seeing profile stacktraces like so:

2018-03-08 19:59:43,333 100167760 ERROR wsgi_profiler [-] 
[4525] [INFO] [POST] URI /v1/actionexecutions
2018-03-08 19:59:43,333 100167760 ERROR wsgi_profiler [-] [4525] [INFO]          21554881 function calls (20361663 primitive calls) in 60.828 seconds

   Ordered by: internal time, cumulative time
   List reduced from 1272 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  3888837    4.593    0.000    4.651    0.000 {isinstance}
    45770    3.272    0.000   25.510    0.001 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/document.py:36(__init__)
   374817    2.745    0.000    8.031    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/fields.py:123(__set__)
874662/830081    2.720    0.000   12.477    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/document.py:144(__setattr__)
53587/1799    2.446    0.000   11.942    0.007 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/dereference.py:163(_attach_objects)
1452191/1255348    2.137    0.000    6.932    0.000 {getattr}
  1052620    2.072    0.000    4.654    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:382(__getitem__)
   874913    1.789    0.000    2.500    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/common.py:5(_import_class)
879819/506309    1.747    0.000   12.198    0.000 {setattr}
   640806    1.711    0.000    5.291    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:396(get)
  1901626    1.603    0.000    1.603    0.000 {method 'get' of 'dict' objects}
   466160    1.531    0.000    4.158    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:410(iteritems)
45637/1056    1.477    0.000   48.524    0.046 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/document.py:672(_from_son)
53587/1799    1.432    0.000    4.081    0.002 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/dereference.py:80(_find_references)
     59/0    1.337    0.023    0.000          /opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/auth.py:482(authenticate)
   187230    1.191    0.000    1.191    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:454(__getattr__)
    45770    1.177    0.000    1.417    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:435(create)
155175/1491    1.032    0.000   30.454    0.020 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/fields.py:299(to_python)
   878632    1.032    0.000    1.032    0.000 {hasattr}
   421229    0.874    0.000    0.874    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:463(__setattr__)
   421229    0.863    0.000    2.633    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:389(__setitem__)
   281293    0.810    0.000    4.409    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/fields.py:113(__get__)
    88774    0.808    0.000    1.567    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:113(__getitem__)
    55696    0.711    0.000    4.900    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:426(__len__)
   222834    0.693    0.000    0.693    0.000 {method 'replace' of 'datetime.datetime' objects}
   466160    0.668    0.000    1.014    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/datastructures.py:424(<genexpr>)
    47312    0.647    0.000   18.743    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/base/fields.py:252(__get__)
114474/82442    0.637    0.000    1.464    0.000 /usr/lib64/python2.7/copy.py:145(deepcopy)
    44602    0.622    0.000    1.341    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/st2common/fields.py:67(_datetime_to_microseconds_since_epoch)
    89078    0.610    0.000    1.621    0.000 /opt/stackstorm/st2/lib/python2.7/site-packages/st2common/fields.py:52(_microseconds_since_epoch_to_datetime)

And general times reaching 100+ seconds as i let it go for longer and longer

[6397] [INFO] [POST] URI /v1/actionexecutions Total time: 100.877137899

This is reproduced using the following action workflow:

---
description: Run a local linux command
enabled: true
runner_type: mistral-v2
entry_point: workflows/mistral-nested.yaml
name: mistral-nested
pack: examples
parameters:
  cmd:
    required: true
    type: string
    default: "ls"
  timeout:
    type: integer
    default: 60
version: '2.0'

examples.mistral-nested:
    description: A basic workflow that runs an arbitrary linux command.
    type: direct
    input:
        - cmd
        - timeout
        - task_names:
            - 'snmp'

    output:
      stdout: "{{ _.stdout }}"

    tasks:
      main:
        action: core.local
        input:
          cmd: "{{ _.cmd }}"
        publish:
          stdout: "main {{ task('main').result.stdout }}"
        on-complete:
          - nested_workflow_1

      nested_workflow_1:
        with-items: "x in {{ range(0, 250, 1) }}"
        action: examples.mistral-jinja-workbook-complex
        input:
          vm_name: "{{ _.cmd }}_{{ _.x }}"
        publish:
          stdout: "{{ task('nested_workflow_1').result }}"
nmaludy commented 6 years ago

Maybe this has something to do with it? https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-comapred-to-native-pymongo-usage

I'm sure the parent workflow is very large with all of the results all in one big workflow object.

bigmstone commented 6 years ago

And here I was wondering what I was going to do in the morning. ;)

jbrownsc commented 6 years ago

I also am hitting this issue with 100s of items in a list. Let me know if there is anything I can provide to help troubleshoot. @nmaludy was kind enough to share his strategy for dealing with this, but would love to get a bump on getting a fix into st2.

Kami commented 6 years ago

Did you encounter this issue when hitting that API endpoint using WebUI (or some other method)?

https://github.com/StackStorm/st2/pull/4300 should help with WebUI performance a lot once that PR is merged and one WebUI starts utilizing it (should be in the next release).

Besides that, I hope to have a look at this exact issue when I get a chance (likely late next week).

nmaludy commented 6 years ago

@Kami i hit this issue by running a Mistral workflow with a lot of with-items

jbrownsc commented 6 years ago

This isn't so much a "UI rendering slow" issue and more of a "My workflow slows down as the number of executions increases" issue. We use with-items to handle deployment, so with 5-10 items running subworkflows, we have great performance, with 50 items, we have bad performance :-)

jbrownsc commented 6 years ago

Here is a good quantification of the issue. Notice in a 50 item, with-items execution, how each iteration takes longer and longer. If you go to 100, 200, etc... As @nmaludy said, the st2api calls can be seen increasing in latency in the logs. The larger the task size (output specifically, the longer it takes) in the workflow

st2 execution get 5b8442b6cd0fbf5236e3ad00 | grep nested_workflow_1
result_task: nested_workflow_1
| + 5b8442b7cd0fbf5236e3ad03  | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:28:07 UTC |
| + 5b8442c4cd0fbf5236e3ad17  | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:28:20 UTC |
| + 5b8442d2cd0fbf5236e3ad2b  | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:28:34 UTC |
| + 5b8442dfcd0fbf5236e3ad3f  | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:28:47 UTC |
| + 5b8442eccd0fbf5236e3ad53  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:29:00 UTC |
| + 5b8442fbcd0fbf5236e3ad67  | succeeded (12s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:29:15 UTC |
| + 5b844308cd0fbf5236e3ad7b  | succeeded (13s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:29:28 UTC |
| + 5b844316cd0fbf5236e3ad8f  | succeeded (13s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:29:42 UTC |
| + 5b844325cd0fbf5236e3ada3  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:29:56 UTC |
| + 5b844334cd0fbf5236e3adb7  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:30:12 UTC |
| + 5b844344cd0fbf5236e3adcb  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:30:28 UTC |
| + 5b844353cd0fbf5236e3addf  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:30:43 UTC |
| + 5b844363cd0fbf5236e3adf3  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:30:59 UTC |
| + 5b844373cd0fbf5236e3ae07  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:31:14 UTC |
| + 5b844382cd0fbf5236e3ae1b  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:31:30 UTC |
| + 5b844391cd0fbf5236e3ae2f  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:31:45 UTC |
| + 5b8443a1cd0fbf5236e3ae43  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:32:01 UTC |
| + 5b8443b0cd0fbf5236e3ae57  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:32:16 UTC |
| + 5b8443c0cd0fbf5236e3ae6b  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:32:32 UTC |
| + 5b8443cfcd0fbf5236e3ae7f  | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:32:47 UTC |
| + 5b8443e0cd0fbf5236e3ae93  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:33:04 UTC |
| + 5b8443f0cd0fbf5236e3aea7  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:33:20 UTC |
| + 5b844400cd0fbf5236e3aebb  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:33:36 UTC |
| + 5b84440fcd0fbf5236e3aecf  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:33:51 UTC |
| + 5b84441fcd0fbf5236e3aee3  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:34:06 UTC |
| + 5b84442ecd0fbf5236e3aef7  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:34:22 UTC |
| + 5b84443fcd0fbf5236e3af0b  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:34:38 UTC |
| + 5b84444ecd0fbf5236e3af1f  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:34:54 UTC |
| + 5b84445dcd0fbf5236e3af33  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:35:09 UTC |
| + 5b84446dcd0fbf5236e3af47  | succeeded (14s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:35:25 UTC |
| + 5b84447dcd0fbf5236e3af5b  | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:35:40 UTC |
| + 5b84448dcd0fbf5236e3af6f  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:35:57 UTC |
| + 5b84449dcd0fbf5236e3af83  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:36:13 UTC |
| + 5b8444aecd0fbf5236e3af97  | succeeded (19s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:36:29 UTC |
| + 5b8444c1cd0fbf5236e3afab  | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:36:49 UTC |
| + 5b8444d2cd0fbf5236e3afbf  | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:37:06 UTC |
| + 5b8444e4cd0fbf5236e3afd3  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:37:23 UTC |
| + 5b8444f5cd0fbf5236e3afe7  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:37:41 UTC |
| + 5b844507cd0fbf5236e3affb  | succeeded (15s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:37:59 UTC |
| + 5b844518cd0fbf5236e3b00f  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:38:15 UTC |
| + 5b844529cd0fbf5236e3b023  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:38:33 UTC |
| + 5b84453bcd0fbf5236e3b037  | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:38:51 UTC |
| + 5b84454ccd0fbf5236e3b04b  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:39:08 UTC |
| + 5b84455ecd0fbf5236e3b05f  | succeeded (16s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:39:26 UTC |
| + 5b844570cd0fbf5236e3b073  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:39:43 UTC |
| + 5b844581cd0fbf5236e3b087  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:40:01 UTC |
| + 5b844593cd0fbf5236e3b09b  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:40:19 UTC |
| + 5b8445a6cd0fbf5236e3b0af  | succeeded (18s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:40:37 UTC |
| + 5b8445b8cd0fbf5236e3b0c3  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:40:56 UTC |
| + 5b8445cacd0fbf5236e3b0d7  | succeeded (17s elapsed) | nested_workflow_1 | examples.mistral-jinja-      | Mon, 27 Aug 2018 18:41:14 UTC |
root@8ade7d214c04:/tmp/st2/contrib/examples/actions/workflows#
Kami commented 6 years ago

I'll try to reproduce this issue today.

The weird thing is that POST operation takes a long time - "[6397] [INFO] [POST] URI /v1/actionexecutions Total time: 100.877137899".

POST aka schedule action execution operation is quite simple so there is no reason it should take that long (it inserts a couple of objects into the database and publishes a message on the messages bus).

It's likely that some other API operation (e.g. GET) is hogging it down, or that there is some weird contention going on on the message bus or similar.

Kami commented 6 years ago

After some digging in - I wonder if the issue lies somewhere else and not in the st2api.

If I run the workflow above which uses Mistral to run a lot of actions in a short time-frame, I noticed large CPU utilization by mistral-server and partially, st2rulesengine.

st2rulesengine utilization is expected, because each execution results in at least one trigger instance.

So could this issue simple be due to resource exhaustion caused by mistral-server hogging most of the available CPU?

Kami commented 6 years ago

@nmaludy @jbrownsc

I tried to replicate this issue on a single server setup.

After some more instrumentation and digging in, I noticed the same behavior as reported above.

When scheduling a lot of executions in a short time-frame through Mistral workflow, mistral-server starts hogging a lot of CPU and as such, everything else (including st2api) slows down due to the lack of resources.

10 - 20 tasks work relatively fine, but as soon as I move above 30 or so, things start to get very slow (and I noticed very high server CPU utilization aka server basically being bogged down). Things gets even worse if I change with-item action from core.local to some Mistral workflow action because it causes additional load on Mistral.

If I concurrency schedule a lot of runs of core.local action for example, I don't see this issue which seems to back up my Mistral hogging up the CPU claim.

In short - I don't think there is any quick or easy code change we can do at the moment - the root cause seems to be related to inefficiencies in Mistral and there is nothing we can do, besides advising users to dedicate more resources to Mistral and other StackStorm components (ideally running StackStorm services across multiple correctly sized servers).

One thing which my also help (but I didn't have time to verify it) is splitting "one large Mistral workflow with a lot of child tasks / actions" to multiple Mistral workflows with less child tasks.

Going forward, we will ensure with-items with a lot of items (e.g. scheduling a lot of actions through Orquesta workflow) has good performance in Orquesta, our new workflow engine.

@nmaludy Are you running all of the StackStorm components on a single server?

@m4dcoder See above.

One of our "benchmarks" for Orquesta should also be orquesta workflow with a lot of with-items operations and verifying it has good / reasonable performance.

m4dcoder commented 6 years ago

I don't think this is with-items only. This seems like a known issue with the scheduler in mistral. If you are executing a lot of concurrent workflows thru mistral, mistral will try to schedule all the workflows and tasks at once. As tasks are completed, it will schedule the next tasks immediately. Soon, there're a lot of tasks waiting to be executed and each workflow will take longer and longer to complete because there's not enough st2 action runners to execute them. At this point in the lifecycle for mistral, since it's going to be deprecated soon, we don't intend to put a lot of time into fixing the scheduler. However, @Kami we have a line item on our roadmap to address this in orquesta, our new workflow engine.

The workaround on mistral here is to define a concurrency policy in st2 @ https://docs.stackstorm.com/reference/policies.html#concurrency for the workflow being executed by the with-items task. Also, please note that if you average the completion time of the workflow execution in the with-items, the average time to completion is more or less the same. The same amount of work (or tasks) get completed over time as a function of the number of system resources and st2 action runners.

jbrownsc commented 6 years ago

@m4dcoder I have tested this with a concurrency policy, and the same issues exists. I'm OK with saying this is a won't fix for Mistral, but the core issue of st2api being slow seems like it will exist with Orquestra too, right?

@Kami with regards to the performance in st2api, the POST calls to st2api should not be taking 3-5 seconds correct?

2018-09-04 19:18:40,894 140660106998480 INFO logging [-] 08eb4db0-523f-4d10-a2f7-ee6c6ffe6e66 - 201 3988 4573.698ms (content_length=3988,request_id='08eb4db0-523f-4d10-a2f7-ee6c6ffe6e66',runtime=4573.698,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:43,374 140660124260944 INFO logging [-] 7a137ae4-47ee-44a9-b64e-f0c7cc12dc1e - 201 3988 4626.373ms (content_length=3988,request_id='7a137ae4-47ee-44a9-b64e-f0c7cc12dc1e',runtime=4626.373,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:44,462 140660107001360 INFO logging [-] 7a121281-20a5-4071-9440-6e9e6eda5d19 - 201 3988 6002.316ms (content_length=3988,request_id='7a121281-20a5-4071-9440-6e9e6eda5d19',runtime=6002.316,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:44,491 140660106999760 INFO logging [-] 10c27e29-7ea8-4279-b02f-ef933a6883ef - 201 3988 6562.452ms (content_length=3988,request_id='10c27e29-7ea8-4279-b02f-ef933a6883ef',runtime=6562.452,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:44,634 140660124261104 INFO logging [-] 444e88f0-8b5e-474f-a5a2-cac5961689ec - 201 3988 5301.957ms (content_length=3988,request_id='444e88f0-8b5e-474f-a5a2-cac5961689ec',runtime=5301.957,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:44,720 140660106998480 INFO logging [-] 527f6af9-9bed-450a-ad8e-f8009376fb68 - 201 3988 3363.479ms (content_length=3988,request_id='527f6af9-9bed-450a-ad8e-f8009376fb68',runtime=3363.479,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:48,806 140660107001040 INFO logging [-] 789239ac-277f-47ab-87fb-60289625e112 - 201 3988 4801.581ms (content_length=3988,request_id='789239ac-277f-47ab-87fb-60289625e112',runtime=4801.581,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:49,820 140660106999920 INFO logging [-] 654f3164-5860-4406-932b-d659edd18fba - 201 3988 5800.771ms (content_length=3988,request_id='654f3164-5860-4406-932b-d659edd18fba',runtime=5800.771,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:49,855 140660124260944 INFO logging [-] f4e36f80-3762-4bf3-8363-1ea9e08f5e8f - 201 3988 5836.541ms (content_length=3988,request_id='f4e36f80-3762-4bf3-8363-1ea9e08f5e8f',runtime=5836.541,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:49,999 140660100791344 INFO logging [-] 12a4ff02-3a55-471b-988e-98bf72b56ef8 - 201 3988 4796.937ms (content_length=3988,request_id='12a4ff02-3a55-471b-988e-98bf72b56ef8',runtime=4796.937,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:50,028 140660100791664 INFO logging [-] 5e0d0812-7911-4b77-938d-dec01df0a7f7 - 201 3988 4824.118ms (content_length=3988,request_id='5e0d0812-7911-4b77-938d-dec01df0a7f7',runtime=4824.118,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:53,794 140660106998000 INFO logging [-] be911c6a-e888-4a8f-95e5-53146d983ee6 - 201 3988 3271.721ms (content_length=3988,request_id='be911c6a-e888-4a8f-95e5-53146d983ee6',runtime=3271.721,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:54,913 140660100788624 INFO logging [-] 9258e4bc-d473-42c4-b637-90c94ceb104e - 201 3988 5142.16ms (content_length=3988,request_id='9258e4bc-d473-42c4-b637-90c94ceb104e',runtime=5142.16,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:55,988 140660100791504 INFO logging [-] bdc16a90-3dc6-4719-a930-5bc260d72337 - 201 3988 6218.387ms (content_length=3988,request_id='bdc16a90-3dc6-4719-a930-5bc260d72337',runtime=6218.387,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-04 19:18:56,035 140660107000240 INFO logging [-] 9785f5fb-493d-44d6-910b-12c6aeaa0e4c - 201 3988 5516.225ms (content_length=3988,request_id='9785f5fb-493d-44d6-910b-12c6aeaa0e4c',runtime=5516.225,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
m4dcoder commented 6 years ago

@jbrownsc I'm only commenting on the performance relating to Mistral workflows and with-items. I'll let @Kami take the question on st2api performance. As for trying it out w/ a concurrency policy, what's the actual policy spec?

jbrownsc commented 6 years ago

EDITED... WORKFLOW REPLACED WITH https://github.com/StackStorm/st2/issues/4030#issuecomment-418495122

m4dcoder commented 6 years ago

@jbrownsc we don't support concurrency policies on inline subworkflow (i.e. subworkflow in a mistral workbook). Try to make another_workflow into another st2 action. Then write a concurrency policy on another_workflow that limits 5 executions at a time as a baseline. The concurrency attribute defined within the Mistral workbook only has limited scope. If you use the st2 concurrency policy (a separate policy file), it will limit the concurrency on that action system wide. Anyway, you can give that a try.

jbrownsc commented 6 years ago

Here is the new workflow.

with-items-test.yaml

---
name: "with-items-test"
description: "For testing things in Mistral"
runner_type: mistral-v2
entry_point: workflows/with-items-test.yaml
enabled: true
parameters:
  workflow:
    default: voltron.with-items-test.main

with-items-test-subworkflow.yaml

---
name: "with-items-test-subworkflow"
description: "For testing things in Mistral"
runner_type: mistral-v2
entry_point: workflows/with-items-test-subworkflow.yaml
enabled: true
parameters:
  workflow:
    default: voltron.with-items-test-subworkflow.main

workflows/with-items-test.yaml

version: '2.0'
name: 'voltron.with-items-test'
description: Noop

workflows:
  main:
    type: direct
    tasks:
      with_items_test:
        with-items: "blah in {{ range(0,250) }}"
        concurrency: 5
        action: voltron.with-items-test-subworkflow

workflows/with-items-test-subworkflow.yaml

version: '2.0'
name: 'voltron.with-items-test-subworkflow'
description: Easy-peasy subworkflow

workflows:
  main:
    type: direct
    tasks:
      test_1:
        action: core.local
        input:
          cmd: echo test_1
        on-success:
          - test_2

      test_2:
        action: core.local
        input:
          cmd: echo test_2
        on-success:
          - test_3

      test_3:
        action: core.local
        input:
          cmd: echo test_3
        on-success:
          - test_4

      test_4:
        action: core.local
        input:
          cmd: echo test_4
        on-success:
          - test_5

      test_5:
        action: core.local
        input:
          cmd: echo test_5
jbrownsc commented 6 years ago

From tcpdump

Host: 127.0.0.1:9101
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.19.1
st2-context: {"parent": {"parent": {"user": "admin", "execution_id": "5b8ee0b6fa214313ae6b77ea", "pack": "voltron"}, "pack": "voltron", "user": "admin", "execution_id": "5b8ee2cafa214313ae6b7cfa", "mistral": {"action_execution_id": "35197061-55a2-483e-bd3b-2bf6e5a9d922", "workflow_name": "voltron.with-items-test.main", "task_execution_id": "2df8478b-fe37-4e5a-98ee-132f1e69093b", "workflow_execution_id": "653d5d03-a769-491d-9416-8e037bf095de", "task_name": "with_items_test", "callback_url": "/v2/action_executions/35197061-55a2-483e-bd3b-2bf6e5a9d922"}}, "mistral": {"action_execution_id": "aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2", "workflow_name": "voltron.with-items-test-subworkflow.main", "task_execution_id": "dbf5f312-6674-4c0b-8b6e-793ec2695ae4", "workflow_execution_id": "7eb62944-0900-4d0b-9473-0bfec9307d31", "task_name": "test_5", "callback_url": "/v2/action_executions/aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2"}}
Content-Type: application/json
Content-Length: 202

{"action": "core.local", "callback": {"url": "http://0.0.0.0:8989/v2/action_executions/aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2", "source": "mistral_v2"}, "notify": {}, "parameters": {"cmd": "echo test_5"}}

HTTP/1.1 201 Created
Server: gunicorn/19.8.1
Date: Tue, 04 Sep 2018 19:54:19 GMT
Connection: keep-alive
Content-Type: application/json
Content-Length: 4637
Set-Cookie: auth-token=03c0a58ef94a4330b344bdfa3e9061fb; Max-Age=86376; Path=/; expires=Wed, 05-Sep-2018 19:53:49 GMT; HttpOnly
Access-Control-Allow-Origin: http://127.0.0.1:3000
Access-Control-Allow-Methods: GET,POST,PUT,DELETE,OPTIONS
Access-Control-Allow-Headers: Content-Type,Authorization,X-Auth-Token,St2-Api-Key,X-Request-ID
Access-Control-Allow-Credentials: true
Access-Control-Expose-Headers: Content-Type,X-Limit,X-Total-Count,X-Request-ID
X-Request-ID: c701ac2e-fa42-49ec-a0af-5238d8180b19

{
    "status": "requested",
    "start_timestamp": "2018-09-04T19:54:13.433146Z",
    "parent": "5b8ee2cafa214313ae6b7cfa",
    "parameters": {
        "cmd": "echo test_5"
    },
    "runner": {
        "runner_module": "local_shell_command_runner",
        "uid": "runner_type:local-shell-cmd",
        "runner_package": "local_runner",
        "description": "A runner to execute local actions as a fixed user.",
        "enabled": true,
        "runner_parameters": {
            "sudo": {
                "default": false,
                "type": "boolean",
                "description": "The command will be executed with sudo."
            },
            "sudo_password": {
                "default": null,
                "secret": true,
                "required": false,
                "type": "string",
                "description": "Sudo password. To be used when paswordless sudo is not allowed."
            },
            "env": {
                "type": "object",
                "description": "Environment variables which will be available to the command(e.g. key1=val1,key2=val2)"
            },
            "cmd": {
                "type": "string",
                "description": "Arbitrary Linux command to be executed on the host."
            },
            "kwarg_op": {
                "default": "--",
                "type": "string",
                "description": "Operator to use in front of keyword args i.e. \"--\" or \"-\"."
            },
            "timeout": {
                "default": 60,
                "type": "integer",
                "description": "Action timeout in seconds. Action will get killed if it doesn't finish in timeout seconds."
            },
            "cwd": {
                "type": "string",
                "description": "Working directory where the command will be executed in"
            }
        },
        "id": "5b610d7ab2a78503a0c1e95a",
        "name": "local-shell-cmd"
    },
    "web_url": "https://2f7ce1346250/#/history/5b8ee2e8fa214313ae6b7d29/general",
    "context": {
        "pack": "core",
        "user": "admin",
        "parent": {
            "parent": {
                "user": "admin",
                "execution_id": "5b8ee0b6fa214313ae6b77ea",
                "pack": "voltron"
            },
            "pack": "voltron",
            "user": "admin",
            "execution_id": "5b8ee2cafa214313ae6b7cfa",
            "mistral": {
                "action_execution_id": "35197061-55a2-483e-bd3b-2bf6e5a9d922",
                "workflow_name": "voltron.with-items-test.main",
                "task_execution_id": "2df8478b-fe37-4e5a-98ee-132f1e69093b",
                "workflow_execution_id": "653d5d03-a769-491d-9416-8e037bf095de",
                "task_name": "with_items_test",
                "callback_url": "/v2/action_executions/35197061-55a2-483e-bd3b-2bf6e5a9d922"
            }
        },
        "mistral": {
            "action_execution_id": "aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2",
            "workflow_name": "voltron.with-items-test-subworkflow.main",
            "task_execution_id": "dbf5f312-6674-4c0b-8b6e-793ec2695ae4",
            "workflow_execution_id": "7eb62944-0900-4d0b-9473-0bfec9307d31",
            "task_name": "test_5",
            "callback_url": "/v2/action_executions/aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2"
        }
    },
    "action": {
        "description": "Action that executes an arbitrary Linux command on the localhost.",
        "runner_type": "local-shell-cmd",
        "tags": [],
        "enabled": true,
        "pack": "core",
        "entry_point": "",
        "notify": {},
        "uid": "action:core:local",
        "parameters": {
            "cmd": {
                "required": true,
                "type": "string",
                "description": "Arbitrary Linux command to be executed on the local host."
            },
            "sudo": {
                "immutable": true
            }
        },
        "ref": "core.local",
        "id": "5b610d7ab2a78503a0c1e963",
        "name": "local"
    },
    "liveaction": {
        "runner_info": {},
        "parameters": {
            "cmd": "echo test_5"
        },
        "action_is_workflow": false,
        "callback": {
            "url": "http://0.0.0.0:8989/v2/action_executions/aa5264a4-7e32-4350-b1c7-6b37ac9aa3e2",
            "source": "mistral_v2"
        },
        "action": "core.local",
        "id": "5b8ee2e5fa214313ae6b7d25"
    },
    "id": "5b8ee2e8fa214313ae6b7d29",
    "log": [
        {
            "status": "requested",
            "timestamp": "2018-09-04T19:54:14.544513Z"
        }
    ]
}
screen shot 2018-09-04 at 1 01 13 pm
jbrownsc commented 6 years ago

I turned Mongodb profiling level to 2 (https://docs.mongodb.com/manual/reference/method/db.setProfilingLevel/) and I see a lot of createIndexes calls as well. When I cancel the execution for the workflow above, those calls go away. So I'm guessing st2api is making those calls to Mongo.

2018-09-04T20:15:00.017+0000 I COMMAND  [conn552] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "status_1", key: { status: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.021+0000 I COMMAND  [conn31] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "trace_tag_1", key: { trace_tag: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.021+0000 I COMMAND  [conn551] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "start_timestamp_1", key: { start_timestamp: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.022+0000 I COMMAND  [conn372] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "action_executions.object_id_1", key: { action_executions.object_id: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.023+0000 I COMMAND  [conn568] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "trigger_instances.object_id_1", key: { trigger_instances.object_id: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.024+0000 I COMMAND  [conn369] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "rules.object_id_1", key: { rules.object_id: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.025+0000 I COMMAND  [conn552] command st2.$cmd command: createIndexes { createIndexes: "trace_d_b", indexes: [ { background: false, name: "start_timestamp_-1_trace_tag_1", key: { start_timestamp: -1, trace_tag: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.032+0000 I COMMAND  [conn31] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "occurrence_time_1", key: { occurrence_time: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.034+0000 I COMMAND  [conn551] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "trigger_1", key: { trigger: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.038+0000 I COMMAND  [conn372] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "occurrence_time_-1_trigger_1", key: { occurrence_time: -1, trigger: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.040+0000 I COMMAND  [conn568] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "status_1", key: { status: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.052+0000 I COMMAND  [conn551] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "occurrence_time_1", key: { occurrence_time: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.053+0000 I COMMAND  [conn372] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "trigger_1", key: { trigger: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.053+0000 I COMMAND  [conn568] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "occurrence_time_-1_trigger_1", key: { occurrence_time: -1, trigger: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
2018-09-04T20:15:00.054+0000 I COMMAND  [conn369] command st2.$cmd command: createIndexes { createIndexes: "trigger_instance_d_b", indexes: [ { background: false, name: "status_1", key: { status: 1 } } ], writeConcern: {} } numYields:0 reslen:149 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_query 0ms
jbrownsc commented 6 years ago

Also looking at the MongoDB query log, I see about 399 queries in 1 second.

cat /tmp/mongo_queries_1_second | perl -ne 'print "$1\n" if /command: (.*?)\s/' | sort | uniq -c | sort -rn
 188 createIndexes
 160 find
  17 update
  11 insert
   4 count
   1 delete
jbrownsc commented 6 years ago

I still think @nmaludy pointed out the DictField serialization glory. https://github.com/MongoEngine/mongoengine/issues/1230 This also references https://github.com/MongoEngine/mongoengine/issues/1230#issuecomment-316308308 which encapsulates the index issue.

Kami commented 6 years ago

How does the server and service load look like (cpu and io usage) when you experience those issues?

I was able to reproduce slow API response the other day, but that was due to mistral-server hogging most of the CPU aka server being overloaded and nothing in particular to do with st2api.

Again, would need to have more information (specifically server and service CPU usage) so we can get a clearer picture on what is going on.

nmaludy commented 6 years ago

When i've seen this happen, st2api is consuming 1 full CPU (100%), the mistral-server is consuming most of a CPU (75-100% back and forth), then the rest of the box has minimal usage. These hosts have 8vCPU and 16 GB of memory so there are plenty of other resources on the box for other services to consume, including the kernel.

Kami commented 6 years ago

All all of those services running on the same box? Did you try pinning st2api and mistral-server process to different cores / CPUs ? Both service are single threaded.

On a related note - how does MongoDB resource utilization look it? As mentioned before, "schedule execution" API operation is relatively simple, it does the following:

  1. Renders config (no-op if there is no config for a pack)
  2. Renders action parameters (shouldn't take long for common use case)
  3. Insert new execution object (simple small object)
  4. Dispatches some messages on the messages bus

None of those operations are particularity complex. Only 1) and 2) could get somewhat slower if you are doing something crazy with the config / action parameters. So it's unlikely that the operation itself is the culprit.

It looks like I will need to try to reproduce this again on a server with more resource. Last time I reproduced it, mistral-server was the bottleneck and cause for slow st2api responses and large CPU usage.

On a related note - I'm also working on some improvements to our metrics framework (already made a lot of those targeted for v2.9.0 release).

It will allow StackStorm services to easily send CPU and memory usage to the metrics backend (statsd at the moment). This should make troubleshooting issues like that much easier in the future.

jbrownsc commented 6 years ago

st2api pegs 1 CPU. Mongo, Postgres, Redis, Rabbitmq all seem like they're doing something, but nothing major. Disk access is around 60/100 write iops on Mongo.

ST2 Process Usage

ps auxf | sort -rnk 3 | head -n5
root      9595 36.3  0.8 201224 99036 pts/5    R+   14:56   4:33          \_ /opt/stackstorm/st2/bin/python /opt/stackstorm/st2/bin/gunicorn st2api.wsgi:application -k eventlet -b 127.0.0.1:9101 --workers 1 --threads 1 --graceful-timeout 10 --timeout 30
mistral    396  3.4  1.0 295400 132836 ?       Ss   Sep03  93:22 /opt/stackstorm/mistral/bin/python /opt/stackstorm/mistral/bin/mistral-server --server engine,executor,notifier --config-file /etc/mistral/mistral.conf --log-file /var/log/mistral/mistral-server.log
st2         52  3.3  0.7 191560 93488 ?        Ss   Sep03  89:19 /opt/stackstorm/st2/bin/python /opt/stackstorm/st2/bin/st2notifier --config-file /etc/st2/st2.conf
root      5120  1.8  0.6 106688 85376 ?        Ss   09:01   6:57 tmux -2 -f /usr/share/byobu/profiles/tmuxrc new-session -n - /usr/bin/byobu-shell
mistral   9592  1.5  0.7 259672 96236 ?        S    14:56   0:11  \_ /opt/stackstorm/mistral/bin/python /opt/stackstorm/mistral/bin/gunicorn --log-file /var/log/mistral/mistral-api.log -b 127.0.0.1:8989 -w 2 mistral.api.wsgi --graceful-timeout 10

Mongo IO

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          44.49    0.00    6.53    0.20    0.00   48.78

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00    82.00    0.00   54.00     0.00   616.00    22.81     0.01    0.19    0.00    0.19   0.19   1.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          42.39    0.00   11.11    0.21    0.00   46.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00   118.00    0.00   83.00     0.00   924.00    22.27     0.02    0.24    0.00    0.24   0.24   2.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          41.53    0.00    6.05    0.20    0.00   52.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00    58.00    0.00   54.00     0.00   512.00    18.96     0.00    0.00    0.00    0.00   0.00   0.00

Mongodb CPU

ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mongodb      1  1.7  3.0 1340620 368672 ?      Ssl  Sep03  47:36 mongod

RabbitMQ

ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     28876  0.0  0.0  18204  3192 pts/0    Ss   09:31   0:00 bash
root     33734  0.0  0.0  36632  2764 pts/0    R+   15:11   0:00  \_ ps auxf
rabbitmq     1  0.0  0.0   4288  1576 ?        Ss   Sep03   0:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq   182  0.0  0.0  30540   248 ?        S    Sep03   0:02 /usr/lib/erlang/erts-8.2.1/bin/epmd -daemon
rabbitmq   314  1.6  1.5 4454428 184392 ?      Sl   Sep03  44:06 /usr/lib/erlang/erts-8.2.1/bin/beam.smp -W w -A 80 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -K true -B i
rabbitmq   429  0.0  0.0   4180  1176 ?        Ss   Sep03   0:08  \_ erl_child_setup 1048576
rabbitmq   444  0.0  0.0  11424   960 ?        Ss   Sep03   0:00      \_ inet_gethost 4
rabbitmq   445  0.0  0.0  13544  1508 ?        S    Sep03   0:00          \_ inet_gethost 4

Postgresql

ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2768  1.2  0.0  19868  3704 pts/0    Ss   15:12   0:00 bash
root      2777  0.0  0.0  38304  3336 pts/0    R+   15:12   0:00  \_ ps auxf
postgres     1  0.0  0.1 287492 23700 ?        Ss   Sep03   0:07 postgres
postgres    26  0.0  0.8 287612 104888 ?       Ss   Sep03   0:07 postgres: checkpointer process
postgres    27  0.0  0.0 287492  3796 ?        Ss   Sep03   0:05 postgres: writer process
postgres    28  0.0  0.0 287492  8464 ?        Ss   Sep03   0:07 postgres: wal writer process
postgres    29  0.0  0.0 287892  6884 ?        Ss   Sep03   0:02 postgres: autovacuum launcher process
postgres    30  0.0  0.0 142624  4540 ?        Ss   Sep03   0:34 postgres: stats collector process
postgres  2660  0.1  0.9 289584 119612 ?       Ss   07:12   0:57 postgres: mistral-user mistral 192.168.208.6(44226) idle
postgres  2661  0.1  0.9 289796 119636 ?       Ss   07:12   0:51 postgres: mistral-user mistral 192.168.208.6(44228) idle
postgres  2662  0.1  0.9 289556 119416 ?       Ss   07:12   0:51 postgres: mistral-user mistral 192.168.208.6(44248) idle in transaction
postgres  2663  0.1  0.9 289528 119684 ?       Ss   07:12   0:55 postgres: mistral-user mistral 192.168.208.6(44254) idle
postgres  2755  0.8  0.8 289428 106000 ?       Ss   14:51   0:10 postgres: mistral-user mistral 192.168.208.6(35828) idle
postgres  2756  0.7  0.8 289432 105376 ?       Ss   14:51   0:09 postgres: mistral-user mistral 192.168.208.6(35848) idle
postgres  2765  0.0  0.1 289316 14092 ?        Ss   14:56   0:00 postgres: mistral-user mistral 192.168.208.6(36606) idle
postgres  2766  0.1  0.1 289324 14092 ?        Ss   14:56   0:01 postgres: mistral-user mistral 192.168.208.6(36608) idle
postgres  2767  0.0  0.1 289192 14544 ?        Ss   15:04   0:00 postgres: mistral-user mistral 192.168.208.6(38164) idle
jbrownsc commented 6 years ago

We have provided a pretty usable test case in https://github.com/StackStorm/st2/issues/4030#issuecomment-418495122

  1. Can we try to disable auto indexes on the documents?
  2. Can we try another dictionary serialization method (like dict as_pymongo)
  3. Can we profile with the changes above?
Kami commented 6 years ago

On a related note - which version of StackStorm are you using? I tested it with the latest version (Mistral callbacks are used and not the query service).

In any case, I / we need to approach this issue more systematically. There are a lot of potential causes mentioned this issue, but this could be unrelated to this issue.

First step should be trying to replicate st2api execution schedule issue by taking Mistral our of the loop (using apachebench directly on POST /v1/executions - I will look into that).

It's unlikely that mongoengine issue mentioned above (large dict issue) would affect this, because as mentioned above POST to /v1/executions doesn't include any large data and a small and simple object is inserted into the database.

Kami commented 6 years ago

I was able to reproduce POST /v1/executions API endpoint being slow during multiple concurrent HTTP requests using apache bench.

There is a lot of going on in that controller, so it will be slow process to find the root cause (could be a lot of small short blocking operations such as render live params, etc. blocking the main eventlet loop).

From what I've seen, "get" queries in that API endpoint doesn't seem to be an issue. There are quite many of them, but all of them return in a sub-millisecond and use an index. None of them also return a lot of data.

jbrownsc commented 6 years ago

@Kami Thanks for looking further. Perhaps the payload of the POST is not the issue, but the concurrency as you mentioned. I don't know how st2api > mongoengine > mongo will handle the createIndexes operations, but I can only assume it's causing some form of blocking. Is there any way we can run your apache bench test with "auto_create_index": False ?

jbrownsc commented 6 years ago

Version

root@bb05c9f45aac:/# st2 --version
st2 2.8.1, on Python 2.7.6
Kami commented 6 years ago

Some more updates - I added a lot of instrumentation and most of the operations performed inside that API controller (including mongo stuff) finish relatively fast.

Problem is that this controller does a lot of things (some of it duplicated) and things simply add up cumulatively and result in poor performance.

I still don't think it's specifically related to mongoengine issue mentioned above, because that endpoint works with small documents and even if I comment out most "heavy" MongoDB operations, performance is still bad.

So having said that, only option we might have to make performance of that API endpoint somewhat decent (compared to scheduling actions directly via rules engine) might be re-creating that API endpoint from the ground up, trying to keep list of steps it performs to minimum.

In the mean time, a not so great, but still a work-around would be to run more st2api processes and load-balance between them.

jbrownsc commented 6 years ago

I'm 100% ok with

In the mean time, a not so great, but still a work-around would be to run more st2api processes and load-balance between them.

Going to test now.

jbrownsc commented 6 years ago

Scaling the workers to 10, using a concurrency policy of 5, seems to have given a pretty sizable increase in performance. None of the requests are over 2.5 seconds, which previously they would be up to 12 seconds as time went on.

2018-09-05 17:17:52,311 140528773867472 INFO logging [-] f59d8708-999e-4dfd-a9ff-48a17499cd5d - 201 4637 1616.369ms (content_length=4637,request_id='f59d8708-999e-4dfd-a9ff-48a
17499cd5d',runtime=1616.369,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:52,821 140528798506352 INFO logging [-] e29c1d43-6f56-49d6-9928-db647dae9f72 - 201 4637 1622.164ms (content_length=4637,request_id='e29c1d43-6f56-49d6-9928-db6
47dae9f72',runtime=1622.164,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:53,440 140528798485392 INFO logging [-] d7ffba69-ffae-4cf5-a678-1fe164b73bcb - 201 4637 1436.057ms (content_length=4637,request_id='d7ffba69-ffae-4cf5-a678-1fe
164b73bcb',runtime=1436.057,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:54,628 140528773871088 INFO logging [-] 4a53007f-2a45-4148-ba3e-7054419ce30e - 201 3338 1243.3ms (content_length=3338,request_id='4a53007f-2a45-4148-ba3e-70544
19ce30e',runtime=1243.3,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:55,012 140528798483792 INFO logging [-] 72d48943-9846-4fc8-b6e9-9e7b7122bd87 - 201 4637 1289.648ms (content_length=4637,request_id='72d48943-9846-4fc8-b6e9-9e7
b7122bd87',runtime=1289.648,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:56,261 140528798488688 INFO logging [-] 006cb59d-44a0-4ba0-8669-6673f14eb49a - 201 3338 1436.544ms (content_length=3338,request_id='006cb59d-44a0-4ba0-8669-667
3f14eb49a',runtime=1436.544,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:56,345 140528773864016 INFO logging [-] 61ad23ad-199d-48c8-91bc-15d768368cea - 201 3338 1524.373ms (content_length=3338,request_id='61ad23ad-199d-48c8-91bc-15d
768368cea',runtime=1524.373,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:56,569 140528773871088 INFO logging [-] fda71d93-657c-4481-9a95-3eaeefd77e7f - 201 4637 1628.84ms (content_length=4637,request_id='fda71d93-657c-4481-9a95-3eae
efd77e7f',runtime=1628.84,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')
2018-09-05 17:17:56,901 140528773867472 INFO logging [-] f7421eb5-c465-4c9b-a09d-39729c345ede - 201 4637 1609.835ms (content_length=4637,request_id='f7421eb5-c465-4c9b-a09d-397
29c345ede',runtime=1609.835,remote_addr='127.0.0.1',status=201,method='POST',path='/v1/executions')

Previous runtime with st2api worker process (1 worker + 1 thread)

| + 5b8ed90cfa214313ae6b6e1a | voltron.with-items-test   | admin        | succeeded                | Tue, 04 Sep 2018 19:12:12 | Tue, 04 Sep 2018 19:37:54  |
|                            |                           |              | (1542s elapsed)          | UTC                       | UTC                        |

Current runtime with st2api worker process (10 worker + 1 thread)

| + 5b900d9b54f6151614f34514 | voltron.with-items-test   | admin        | succeeded (739s elapsed) | Wed, 05 Sep 2018 17:08:43 | Wed, 05 Sep 2018 17:21:02  |
|                            |                           |              |                          | UTC                       | UTC                        |
Kami commented 6 years ago

@jbrownsc

I'm 100% ok with In the mean time, a not so great, but still a work-around would be to run more st2api processes and load-balance between them. Going to test now.

Yeah this should help. Single process performance is not great, but it scales horizontally so running more workers (until saturating all the available CPU cycles) should definitely help.

If you can easily do it, it would also be great if you can test changes from here - https://github.com/StackStorm/st2/pull/4331. Wonder how much it increases performance in your specific scenario.

I can let you know once PR is merged into master and new v2.9dev packages are built.

Kami commented 6 years ago

Today I tested changes from #4331 using latest v2.9dev packages and workflows from https://github.com/StackStorm/st2/issues/4030#issuecomment-418495122 on AWS t2.medium instance.

I initially tested with range 250, but then changed it to 50 so things complete faster (on v2.8.1 and small AWS instance, range 250 simply takes way too long).

v2.8.1 stable

ubuntu@ip-172-31-24-123:/opt/stackstorm/packs/default$ st2 --version
st2 2.8.1, on Python 2.7.12

1) 1 st2api gunicorn worker process (default setup) - 250 range

| + 5b96348bbe2a4332e2a2dca1 | default.with-items-test | st2admin     | succeeded                | Mon, 10 Sep 2018 09:08:27   | Mon, 10 Sep 2018 10:34:20 UTC |
|                            |                         |              | (5153s elapsed)          | UTC                         |                               |
+----------------------------+-------------------------+--------------+--------------------------+-----------------------------+-------------------------------+

5153 seconds

2) 1 st2api process gunicorn worker (default setup) - range 50

|                            |                         |              | (5153s elapsed)          | UTC                         |                               |
| + 5b964a19be2a43669a8498d6 | default.with-items-test | st2admin     | succeeded (619s elapsed) | Mon, 10 Sep 2018 10:40:25   | Mon, 10 Sep 2018 10:50:44 UTC |
|                            |                         |              |                          | UTC                         |                               |
+----------------------------+-------------------------+--------------+--------------------------+-----------------------------+-------------------------------+

619 seconds

v2.9dev

ubuntu@ip-172-31-18-251:~$ st2 --version
st2 2.9dev (c4db536), on Python 2.7.12

1) 1 st2api gunicorn worker process (default setup) - range 250

| + 5b96364c88b07d3318ad9041 | default.with-items-test | st2admin     | succeeded                | Mon, 10 Sep 2018 09:15:56   | Mon, 10 Sep 2018 09:51:44 UTC |
|                            |                         |              | (2148s elapsed)          | UTC                         |                               |
+----------------------------+-------------------------+--------------+--------------------------+-----------------------------+-------------------------------+

2148 seconds

2) 1 st2api process gunicorn worker (default setup) - range 50

| + 5b96414d88b07d3318ad9bfc | default.with-items-test | st2admin     | succeeded (157s elapsed) | Mon, 10 Sep 2018 10:02:53   | Mon, 10 Sep 2018 10:05:30 UTC |
|                            |                         |              |                          | UTC                         |                               |

157 seconds

From those numbers it looks like the speed up is more than 50% for this particular use case (scheduling actions via Mistral / StackStorm API) and workflow.

A couple of things:

  1. Keep in mind that v2.8.1 version uses Mistral callbacks approach for getting Mistral results back into StackStorm. If you are using an older versions which uses polling approach, the results will differ.

  2. st2api should scale horizontally so the throughput should also scale horizontally if there are spare CPU cycles and / or cores available (it's probably not perfect linear scaling, but should be close to it).

In my setup scaling st2api worker processes beyond 1 didn't make sense because I used a small AWS instance and CPU was already fully utilized by all the StackStorm services with a single st2api worker process (in fact, as expected, the performance was worse, because it just saturated already over-saturated CPU even more).

techdiverdown commented 3 years ago

Is there any update on this, we still see rather slow performance with mongodb for large datasets.

Kami commented 3 years ago

4846 should help substantially with this issue.