Storing large files in action results appears to be limited by MongoDB's 16 MB document limit

dead10ck commented 7 years ago

I am evaluating StackStorm as a replacement for an internal tool we use. Our primary motivation is being able to distribute actions across multiple machines by running several action runners. Our use case requires that certain actions be able to access the data that other actions generated, so I hoped that we could store the action data directly in the action result that StackStorm stores in MongoDB; however, it looks like actions are restricted by MongoDB's 16 MB limit on documents. e.g., if you run the core.http module that delivers a large file, the action will fail:

root@b0024465368b:/# st2 execution get 59691e1266941d00f54a7283
id: 59691e1266941d00f54a7283
status: failed
parameters: 
  url: http://foo.com/some_large_file
result: 
  error: command document too large
  traceback: "  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/worker.py", line 132, in _run_action
    result = self.container.dispatch(liveaction_db)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/container/base.py", line 68, in dispatch
    action_db=action_db, liveaction_db=liveaction_db)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/container/base.py", line 131, in _do_run
    raise e
"

Since our use case also involves running actions on any of several machines, I'm not sure if it would work to just store the file somewhere on the action runner's file system, since other action runners would need access to it.

Some workaround ideas include:

Using an NFS that is mounted on all action runners to store these payloads in.
Having one action that downloads these files and stores them in S3, then puts a reference to that file in the action results to be used by downstream actions that process this data.

Both of these workarounds require actions to make assumptions about where and how shared data gets stored, and require additional actions that delete old data when action executions age out.

It would be great if StackStorm could handle large action payloads transparently. MongoDB's GridFS is a feature that was introduced to handle files larger than 16 MB, so maybe that is a viable option.

vincent-legoll commented 6 years ago

I've hit the same limitation. Here is a similar traceback from my log files:

'result': '{"traceback": "  File \\"/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/worker.py\\", line 159, in _run_action\
    result = self.container.dispatch(liveaction_db)\
  File \\"/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/container/base.py\\", line 88, in dispatch\
    liveaction_db=liveaction_db\
  File \\"/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/container/base.py\\", line 144, in _do_run\
    liveaction_db = self._update_status(liveaction_db.id, status, result, context)\
  File \\"/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/container/base.py\\", line 299, in _update_status\
    raise e\
", "error": "command document too large"

This is a very annoying limitation, we have long-running commands which produce lots of output, and we want to have those stored in stackstorm history...

bigmstone commented 6 years ago

It's possible to store stdout and stderr as binary data, but this wouldn't fix all cases. I'm not currently convinced ST2 should handle this case.

My current view on it is if an ST2 action needs to store something large it should utilize it's own store for this. I could be convinced otherwise, but don't see it at the moment.

Tapping @armab and @Kami for their opinion.

dead10ck commented 6 years ago

@bigmstone I talked about it a bit in the issue description, but the rationale for handling this is that for any use case that involves data larger than 16 MB (which could be as simple as downloading a log text file from a server) requires the user to work around the limit with strategies like external indexing, which would pollute every action in your workflow and just add a lot of cruft. Without handling this case, ST2 is simply not a good fit for any use case that involves a non-trivial amount of data.

arm4b commented 6 years ago

It's also interesting that previous MongoDB hard limit was 4MB: https://jira.mongodb.org/browse/SERVER-431 before lifting to 16MB.

More background about the limitation: https://docs.mongodb.com/manual/reference/limits/#BSON-Document-Size and one of the possible solutions with utilizing gridfs which can use 16MB+: https://docs.mongodb.com/manual/core/gridfs/#when-to-use-gridfs. Another way is to split the document ourselves and store it in several chunks if it exceed the size. Obviously this needs a research and understanding how that change will affect st2 and if it's acceptable or not.

We're however open to Opensource contributions for "corner case" issues like this, while working on fixing other critical bugs and adding new StackStorm features per our roadmap.

dfsutherland commented 8 months ago

This is a huge limitation. My use case includes large log files, http responses containing many hundreds of thousand data items, and other large-ish data. Nothing large enough to be considered "Big Data," but easily large enough to exceed 16MB. Even a not-too-smart implementation using gridfs would be a great help.

StackStorm / st2

Storing large files in action results appears to be limited by MongoDB's 16 MB document limit #3600