falconry / falcon

The no-magic web data plane API and microservices framework for Python developers, with a focus on reliability, correctness, and performance at scale.
https://falcon.readthedocs.io/en/stable/
Apache License 2.0
9.51k stars 937 forks source link

Add ability to cache the stream data #1746

Open CaselIT opened 4 years ago

CaselIT commented 4 years ago

It is sometimes useful to have the ability to use both the raw stream and the media in a request. Examples of this may be to implement a cache on graphql and similar apis that use POST to implement a querying endpoint. Something like

class CacheMiddleware:
  def process_resource(self, req, resp, resource, params):
    if not should_cache(req):
      return
    key = get_key(req.bounded_stream)
    if has_key(key):
      res.data = get_key(key)
      res.complete = True
    # How do the normal processing uses media here?

In the example above if the cache misses, there is transparent way of inspecting the data stream in without preventing use of Request.media.

To aid this use case we could implement the method get_cached_data(max_size) (or a similar name) in Request.

This method would:

The implementation would be something like

# in Request class
def get_cached_data(self, max_size: int):
  bs = self.bounded_stream
  if bs.stream_length > max_size:
    raise Exception # or return error
  if not bs.is_pristine:
    raise Exception # stream already read (not sure if that has a point, since one could have read from stream)
  stream = BytesIO(bs.read())
  bs.stream = stream
  bs._bytes_remaining = bs.stream_length
  return stream.getvalue()
open-collective-bot[bot] commented 4 years ago

Hi :wave:,

Thanks for using Falcon. The large amount of time and effort needed to maintain the project and develop new features is not sustainable without the generous financial support of community members like you.

Please consider helping us secure the future of the Falcon framework with a one-time or recurring donation.

Thank you for your support!

vytas7 commented 4 years ago

(Pasting my input from Gitter)

Related community requests:

Personally, the cleanest solution IMHO would be

If we choose this path, maybe we could also show examples how to override the default bounded reader with a no-op passthrough function, so that the raw Request.stream would be made available to media instead (to shave off a couple of CPU cycles).

Edit: While at a first glance this looks extremely straightforward to implement, one complication is that this would force us to define the minimum stream interface (for both WSGI and ASGI) required for such replacement readers. Which might be a good thing™ in itself though.

CaselIT commented 4 years ago

Rename Request.bounded_stream to something more fitting (yeah, naming is hard... maybe just Request.reader?)

Not really convinced by reader, since it looses the "stream" from the name. Other random alternatives reader_stream | stream_reader | wrapped_stream.

vytas7 commented 4 years ago

Indeed, stream_reader or body_reader would probably be most fitting.

CaselIT commented 4 years ago

After thinking a bit about this, I think that extending the stream is probably the most expandable option. I think that maybe we could provide a CachedStream alongside the BoundedStream. Regarding where this parameter would go, request_options is an alternative or maybe like the Context https://github.com/falconry/falcon/blob/113eea113844058816ed745d8dc9cb197e75cb95/falcon/request.py#L435 ?

vytas7 commented 4 years ago

I wrote request_options without thinking too much tbh. But it might indeed make more sense to follow the context_type pattern, and define something like body_reader_type.

CaselIT commented 4 years ago

body_reader_type

I still think stream should be in the name tbh.

ruimlopes commented 3 years ago

I stumbled on this limitation while developing an app that required me to store the raw request body it receives (falcon==2.0.0).

The workaround I found was to develop a middleware class that "adds" the property request.body.

However, this approach renders request.media unusable and impacts the long term maintenance of the application, a developer already familiar with the falcon framework would not be able to straightforwardly use part of the framework, and will find a property being used that is not documented globally. Also it forces me to connect to a data store in the middleware which I really do not like.

It is sometimes useful to have the ability to use both the raw stream and the media in a request.

Is it access the raw stream really the use case? I might have tunnel vision on my particular use case but what I felt missing is the access to the request body. If we phrase it like that, feels natural to have a request.body property.

I share my suggestion below (as a proof of concept as I didn't run it).

# Request class
def __handle_request_stream(self):
    handler = self.options.media_handlers.find_by_media_type(
        self.content_type,
        self.options.default_media_type
    )

    try:
        self._body = self.bounded_stream.read()

        self._media = handler.deserialize(
            BytesIO(self._body),
            self.content_type,
            self.content_length
        )
    finally:
        if handler.exhaust_stream:
            self.bounded_stream.exhaust()

def get_media(self):
    if self._media is not None or self.bounded_stream.eof:
        return self._media

    self.__handle_request_stream()

    return self._media

media = property(get_media)

def get_body(self):
    if self._body is not None or self.bounded_stream.eof:
        return self._body

    self.__handle_request_stream()

    return self._body

body = property(get_body)

I can contribute on this topic if help is accepted.

CaselIT commented 3 years ago

It is sometimes useful to have the ability to use both the raw stream and the media in a request.

Is it access the raw stream really the use case? I might have tunnel vision on my particular use case but what I felt missing is the access to the request body.

The goal is similar in allowing multiple accesses to the body. But I think that allowing a way to override how the stream behaves has the least impact on the behaviour of falcon. Meaning that you could retain the current logic as default, ie the stream is not cached, but also allow different behaviours if the user wishes.

Introducing a body would mean that the stream would be cached, and that may not be feasible in some cases.

vytas7 commented 3 years ago

Yes (well OK, I'm partial cause it was my suggestion :smiling_imp: ), the request body may be way too large to be stored in memory in the general case.

@ruimlopes Your approach, although admittedly handy for simple JSON cases, would also impair the multipart form handler (to be released in Falcon 3.0), and other custom handlers that people may be using to consume arbitrary streams without buffering everything at once (or buffer to disk, cloud or w/e but not to RAM).

ruimlopes commented 3 years ago

Introducing a body would mean that the stream would be cached, and that may not be feasible in some cases.

Yes (well OK, I'm partial cause it was my suggestion 😈 ), the request body may be way too large to be stored in memory in the general case.

How is that caching behavior ultimately different than what already happens for request.media? Indeed, there is a memory overhead by storing request.media and request.body together. A possible deviation would be to use request.body and request.media in a mutually exclusive way, solving the memory overhead. Also, I would say it would work for anything that implements BaseHandler, not only simple json cases.

For other users that might crash here with the same/similar use case, we will most likely upgrade our workaround to use a custom handler that simply reads the stream into request.media. We will do json.loads() when we need it for the first time.

From user perspective, one would expect most common use cases to be naturally available. It seems we have opposite understanding regarding the usage of the request body on that matter - we (the team I work in) are very used to naturally access it in other frameworks, so it's a common use case for us, which sadly is not available without custom code and added maintenance burden.

CaselIT commented 3 years ago

How is that caching behavior ultimately different than what already happens for request.media?

This mostly depends on the media handler. Like the multipart form handler may not cache all the stream.

But I agree with you, that it would be nice to have the ability to not lose the body once accessed, but I'm personally not sure if body is the best option here.

we (the team I work in) are very used to naturally access it in other frameworks

Here were you referring the body or the media? or both?

vytas7 commented 3 years ago

@ruimlopes, just purely practically, if we added support for plugging another reader factory type, would req.stream_reader.data or req.stream_reader.get_data() be so much inferior? Particularly if we shipped an optional cached reader class as part of Falcon.

I haven't explored all the ramifications, but my gut feeling is that building this directly into the standard Request object would get even hairier in the ASGI variant (i.e., the async version of Falcon).

vytas7 commented 3 years ago

Related https://github.com/falconry/falcon/pull/493#discussion_r40822962

vytas7 commented 3 years ago

@CaselIT I've probably missed an important detail, but how is this a breaking change?

CaselIT commented 3 years ago

I guess only if we enable this by default. but we could avoid it. I'll place this in 3.x then

vytas7 commented 3 years ago

Ah, I'm not advocating enabling any caching out of the box. But we could ship such an optional component as part of Falcon, alongside documentation how to easily enable that if needed. Probably something not too dissimilar from CORS middleware.