StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.05k stars 745 forks source link

Writes to Mongodb in Actionrunner are slow and cpu intensive with large documents #4798

Closed jdmeyer3 closed 3 years ago

jdmeyer3 commented 4 years ago

SUMMARY

When an action return ~4MB of data the actionrunner writing to MongoDB takes ~20 seconds for each document. During the writes, the CPU utilization spikes to 100%.

STACKSTORM VERSION

St2 v3.1.0

OS: CentOS 7.7.1908 Kernel 3.10.0-957.el7.x86_64 Kubernetes v1.14.1 Docker 19.03.2 Base Docker Image: CentOS 7.6.1810 (custom image)

Steps to reproduce the problem

In a Python 3 pack, create a python action with the following code

#!/usr/bin/env python3

import json

import requests
import datetime
from st2common.runners.base_action import Action

class RequestLargeDataset(Action):
    """
    Retrieves ~4MB of json data
    """

    def run(self, **kwargs):
        dataset = requests.get('https://data.ct.gov/api/views/rybz-nyjw/rows.json?accessType=DOWNLOAD')
        data = json.loads(dataset.text)
        return data

Expected Results

Low CPU utilization and the result returning within a relatively short time (<5 seconds)

Actual Results

CPU spiked to 100% for 20-30 seconds

image

adding some logs around st2actions.container.base.py:296 with


    def _update_status(self, liveaction_id, status, result, context):
        try:
            LOG.error("updating liveaction")
            LOG.debug('Setting status: %s for liveaction: %s', status, liveaction_id)
            liveaction_db, state_changed = self._update_live_action_db(
                liveaction_id, status, result, context)
            LOG.error("done updating liveaction")
        except Exception as e:
            LOG.exception(
                'Cannot update liveaction '
                '(id: %s, status: %s, result: %s).' % (
                    liveaction_id, status, result)
            )
            raise e

        try:
            LOG.error('updating execution')
            executions.update_execution(liveaction_db, publish=state_changed)
            LOG.error('done updating execution')
            extra = {'liveaction_db': liveaction_db}
            LOG.debug('Updated liveaction after run', extra=extra)
        except Exception as e:
            LOG.exception(
                'Cannot update action execution for liveaction '
                '(id: %s, status: %s, result: %s).' % (
                    liveaction_id, status, result)
            )
            raise e

        return liveaction_db

I'm getting logs like

2019-10-02 18:17:23,854 ERROR [-] updating liveaction
2019-10-02 18:17:28,963 ERROR [-] done updating liveaction
2019-10-02 18:17:28,964 ERROR [-] updating execution
2019-10-02 18:17:33,268 ERROR [-] done updating execution

Where it takes 10 seconds to update liveaction and another 5 seconds to update execution.

Additionally, I've done a cProfile against the actionrunner and I'm seeing the following image

It looks like the CPU is spending most of its time building the mongoengine document object. May be related to https://github.com/MongoEngine/mongoengine/issues/1230

Kami commented 4 years ago

Thanks for reporting this.

There we indeed multiple mongoengine issues related to writing large documents in the past.

Over the releases, the performance was improved in mongoengine itself and we've also done some performance optimizations, but it could be that there are more issues hiding (would need to dig in).

In the past, we also discussed an option of skipping mongoengine serialization / conversion layer all together and using pymongo directly in critical code paths. But this would require quite some code changes.

Kami commented 3 years ago

I know this issue has been opened quite a long time ago, but the problem hasn't gone away yet and I finally had a chance to work on #4846 and made good progress on it.

Initial micro benchmarks and end to end load tests are very promising - we see database write speed improvements up to 10x and up to 5x for reads when writing / reading execution results - all of that of course also translates to much faster action execution and workflow completion times for executions / workflows which operate on larger datasets.

Some more data and numbers is available here - https://github.com/StackStorm/st2/pull/4846#issuecomment-781629337, https://github.com/StackStorm/st2/pull/4846#issuecomment-782131035, https://github.com/StackStorm/st2/pull/4846#issuecomment-782837693.

I hope there will be no last minute surprises and we will be able to include those changes with v3.5.0 release.

Kami commented 3 years ago

Oh, and when I was testing the fix end to end with Python runner action which returns 4 MB of data (very similar to the example scenario you used), the whole action execution with the new code on my computer now takes ~1 second vs ~12 seconds with current master.

This includes the whole action execution flow end to end - from execution being scheduled, action ran by the Python runner and all the state persisted to the database (action execution, live action objects, etc.). Actual database writes itself for those two models are now in the range of ~200 ms range.

amanda11 commented 3 years ago

As per comments above, this should be much improved in 3.5.0.

arm4b commented 3 years ago

I really think we can close this now with the https://github.com/StackStorm/st2/pull/4846 performance improvements by @Kami.

@jdmeyer3 Feel free to re-open it if you're still seeing an unacceptable performance with the st2 v3.5.0.