StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.04k stars 746 forks source link

Set hard limit for maximum API response size #4550

Open Kami opened 5 years ago

Kami commented 5 years ago

Right now we have no hard limits for API response size.

This means if user requests a lot of data (e.g. GET /v1/executions`` API call with many very large executions in the database), it will take a long time for API to convert from pymongo to mongoengine objects and then serialize those objects as JSON for end-user consumption.

It's very likely that this will also block API gunicorn worker (DB reads are async, but object conversion and JSON serialization is not) and possibly cause a time out. This is problematic because our default st2api configuration file starts st2api server with a single gunicorn worker process.

In short term we should do the following:

  1. Document this issue and document workarounds (spin more st2api services, increase number of gunicorn workers per service, increase gunicorn worker timeout, switch to StackStorm >= v2.9.0 which includes ?include_attributes improvements so only data which is needed is retrieved).
  2. Increase gunicorn worker timeout in a default st2api service manager file

In slightly longer term, I propose following changes:

  1. Enforce hard "maximum response" size limit for API responses.

This way we will prevent gunicorn API process from ever being blocked for too long.

There are multiple ways to achieve that, but we need to do that, ideally before retrieving DB models from the database and before serializing it to JSON.

Doing it before serializing it to JSON is easy (after we retrieve DB objects from the database). Doing it before pymongo doing it's internal dict conversion will be a bit more involved.

We will likely need to utilize mongodb Object.bsonsize(db.<collection name>.find({query})) operation to retrieve number of bytes for a particular set of MongoDB documents in a collection.

Then we need to determine "reasonable" limit which API on average sized server can handle. If resulting set is larger than this limit (perhaps also configurable in st2.conf), we should return 400 bad request with a message along the lines of "Requested data set is too large to be processed, please limit the request size by using ?limit=<n>, ?include_attribute=<attributes> and other API query param filters.

This way we will ensure StackStorm API process will never be blocked for too long (you can also look at it as a DDoS / rate limiting protection).

To be on the safe side, we should also add ?force=true or similar query parameter which admins can use to skip that check (of course at their own risk with a big warning in the docs).

  1. More WebUI improvements

There is still a lot of place in WebUI to reduce the load on StackStorm API and only retrieve data it needs in a lazy / when needed fashion.

  1. Try to offload (blocking) response JSON serialization to eventlet thread pool

Serializing large objects to JSON can be blocking. One way to prevent main event loop from being blocked would be to offload those potentially blocking operations to eventlet thread pool.

I tried this in the past (as a quick improvement), but it's sadly quite involved and I'm not even sure it's possible at all with our current code base (we would need to make the code thread safe which is very hard to do with all the mongoengine objects which cross reference each other, etc.).

nmaludy commented 5 years ago

I'm a little concerned about this from an unpredictability standpoint. Many times i don't know how big my execution data size is (and can vary).

On our instances we disable (or set to unlimited) the size checking for both NGINX and Mistral to avoid this problem, see https://forum.stackstorm.com/t/allowing-large-results-in-mistral-workflows/139

As long as there is a way to disable the size checking in the config, we'll be OK.

Kami commented 5 years ago

Yeah, there will likely be a way to disable it in the config and change the limit.

This will also be less of an issue in the future for Orquesta workflows since results won't be sent back to StackStorm via the API anymore (same goes for retrieving data which is done directly via DB layer in workflow engine service).

So it will really just be protecting against potentially malicious users and bad clients which don't utilize ?include_attributes, etc.

nmaludy commented 5 years ago

Perfect, then i'm good with it!

I'm also thinking from a "debugging" perspective via st2 CLI when looking at execution history and inspecting workflows... that stuff may come through the API. I think if we just had clear error messages and a command line flag to disable the size checks at runtime that could help also:

Example / mock of up an error message:

$ st2 execution get xyz123
ERROR: Size of the execution results is larger than the configured maximum API response (10MB). To change this please edit [api] max_response_size_bytes in the StackStorm config /etc/st2/st2.conf or pass in --no-size-limit to the st2 CLI to disable this for a single command.
Kami commented 5 years ago

Yeah the error definitely needs to be very explicit and user-friendly :)

On a related note - I'm also researching some related performance improvements.

arm4b commented 5 years ago

I'm :-1: on the approach itself of setting the limits.

Adding workarounds like that signalize that there is another root cause we have to solve in a different way. From the @Kami detailed description it looks like there are many solutions to the problem. I just think that limit is the most draconian approach for users.

bigmstone commented 5 years ago

The biggest problem I see is exceeding the gunicorn timeout. That is completely silent to the user (from log perspective). You must run gunicorn in foreground to get the notice that it has time out. Even if all of our code is reasonably performant it is still possible to craft a request that exceeds that limit. I.E. a large deployment and you request a large number of objects with no filtering.

So failing with proper feedback is necessary there. I.E. returning a 408 instead of a 500 or a client disconnected as it does now.

One fear I have is how long it will take mongo to respond with bsonsize - might be negligible, but still a fear at the moment for me.

In the end I agree w/ @armab to a point. I think we should get to the bottom of why a reasonably sized request takes so long to make. Right now this is most acutely felt with Mistral in the mix, but I suspect mongoengine is playing a role as well. I don't know that hard size limits matter much since we are already bound by a timeout...which is another way of measuring a request's size. I just think when we exceed that time limit we have to handle it better.

I tried to verify the above statement...from docs:

Generally set to thirty seconds. Only set this noticeably higher if you’re sure of the repercussions for sync workers. For the non sync workers it just means that the worker process is still communicating and is not tied to the length of time required to handle a single request.

So perhaps we're not as "bound" by that timeout as it seems, but hard to know for sure without some experimentation.