StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.05k stars 745 forks source link

MongoDB sort memory limit exceeded when retrieving execution history #5947

Open machneit opened 1 year ago

machneit commented 1 year ago

SUMMARY

Hi, After updating Stackstorm from 3.6 to 3.8 I ran into a sort memory limit issue with MongoDB when gathering the execution history.

STACKSTORM VERSION

st2 3.8.0, on Python 3.8.13

OS, environment, install method

Single-boxed: Stackstorm 3.8, MongoDB 5.0.6, and RabbitMQ 3.8.8 are installed on a single RedHat8 server.

Steps to reproduce the problem

You can reproduce the problem by running the 2 following commands : mongosh -> db.adminCommand({setParameter: 1, internalQueryMaxBlockingSortMemoryUsageBytes: 1}).
This limits the allowed memory usage to 1 byte which generates the exact same error I encountered.

Expected Results

I expected to see the history.

Actual Results

Error encountered when accessing the history (see detailed error message section of this issue): image image This is probably due to the way Stackstorm gathers the execution history (line 1135, in _get_action_executions -> return super(ActionExecutionsController, self)._get_all() in the actionexecutions.py script).
Could this be fixed/optimized in the next Stackstorm release? I found a temporary fix by tripling the value of the MongoDB ‘internalQueryMaxBlockingSortMemoryUsageBytes’ parameter, but the history takes more than a minute to load versus a couple seconds before. This error never apeared in Stackstorm 3.6, so it's probably related to an update in how Stackstorm gathers the history.

Detailed error message:

2023-02-23 08:06:22,140 140164375924048 ERROR router [-] Failed to call controller function "get_all" for operation "st2api.controllers.v1.actionexecutions:action_executions_controller.get_all": Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in., full error: {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'} Traceback (most recent call last): File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/router.py", line 632, in call resp = func(**kw) File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/v1/actionexecutions.py", line 739, in get_all return self._get_action_executions( File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/v1/actionexecutions.py", line 1135, in _get_action_executions return super(ActionExecutionsController, self)._get_all( File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/resource.py", line 244, in _get_all result = self.resources_model_filter( File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/resource.py", line 570, in resources_model_filter result = super( File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/resource.py", line 275, in resources_model_filter for instance in instances[offset:eop]: File "/opt/stackstorm/st2/lib/python3.8/site-packages/mongoengine/queryset/queryset.py", line 110, in _iter_results self._populate_cache() File "/opt/stackstorm/st2/lib/python3.8/site-packages/mongoengine/queryset/queryset.py", line 129, in _populate_cache self._result_cache.append(next(self)) File "/opt/stackstorm/st2/lib/python3.8/site-packages/mongoengine/queryset/base.py", line 1574, in next raw_doc = next(self._cursor) File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/cursor.py", line 1207, in next if len(self.data) or self._refresh(): File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/cursor.py", line 1124, in _refresh self.send_message(q) File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/cursor.py", line 999, in send_message response = client._run_operation_with_response( File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1368, in _run_operation_with_response return self._retryable_read( File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1471, in _retryable_read return func(session, server, sock_info, slave_ok) File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1360, in _cmd return server.run_operation_with_response( File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/server.py", line 135, in run_operation_with_response _check_command_response(first, sock_info.max_wire_version) File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/helpers.py", line 164, in _check_command_response raise OperationFailure(errmsg, code, response, max_wire_version) pymongo.errors.OperationFailure: Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in., full error: {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'} (_exception_class='OperationFailure',_exception_message="Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in., full error: {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'}",_exception_data={'_message': "Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in., full error: {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'}", '_error_labels': set(), '_OperationFailurecode': 292, '_OperationFailure__details': {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'}, '_OperationFailure__max_wire_version': 13})

nzlosh commented 1 year ago

Hi,

Thank you for reporting this issue. At this time, mongodb v5 is not supported by StackStorm. You will need to use mongodb v4.4

machneit commented 1 year ago

Hi, Do you know when MongoDBv5 will be supported ?

usulkies commented 1 year ago

I experience the same issue. Is there any possible fix or workaround? Is there a way to add the allowDiskUse:true option to the history query?

bertyah commented 1 year ago

Also experiencing this issue on mongo 4.4 Seems like I ran an action with a large dataset or I ran it too many times? Not sure. I can filter different actions and see them fine but when I choose this specific action (or all actions) I get the error. I noticed that the stdout of these workflows are extremely large, so that’s definitely why it’s causing the errors. The total result of multiple executions surpass the Mongdb memory limit

rubyjedi commented 1 year ago

Also seeing this with Stackstorm 3.8 on Mongo 4.4. I had to roll back to Stackstorm 3.6, which does not manifest this issue.

Is there a workaround or fix expected to resolve this?

Y4N4EL commented 11 months ago

Hi, I'm facing the same problem after an update from 3.6 to 3.8, is there any fix planned ?

arm4b commented 11 months ago

Thanks for the report.

Could be a regression? If someone could try to dig deeper into the v3.8.0 changelog (https://github.com/StackStorm/st2/releases/tag/v3.8.0) and identify the PR that caused it, as well as provide a PR to fix it, - that'll get it into the next release.

guzzijones commented 11 months ago

This happens with very large output/input action executions on the history page. To make a temporary fix I set:

setParameter:
  internalQueryExecMaxBlockingSortBytes: 103554432

in /etc/mongod.conf

nzlosh commented 11 months ago

This might be useful to add to the troubleshooting documentation https://docs.stackstorm.com/troubleshooting/database.html

guzzijones commented 11 months ago

The default sort is set to -start_timestamp and action.ref for the history page, yet there is no compound index. The fix for the history page is to add a compound index to actionexecutions model.

guzzijones commented 11 months ago

Here is the default history page sort code

RoyalBosS-Ayush commented 2 months ago

Facing same problem. Any solution?

usulkies commented 2 months ago

Facing same problem. Any solution?

The workaround for now is to enable the Garbage collector by providing configuration for it. I set 3 days retention for jobs history, and currently it seems enough

ghost commented 2 months ago

happens with very large output/input action executions on

after eppending these lines to that file, my mongo connection got lost

fdrab commented 2 months ago

Facing same problem. Any solution?

The workaround for now is to enable the Garbage collector by providing configuration for it. I set 3 days retention for jobs history, and currently it seems enough

what's your setup / # of action or workflow executions per day? 3 days seems like a very short time