Resource serialization/deserialization to/from representation media types

kgriffs commented 11 years ago

How to do serialization/deserialization is a common question from the community. How can we make this work out of the box?

Things to decide:

Should this functionality be implemented in Talons as middleware for falcon>=0.2 ?
Alternatively, should this be implemented directly in the framework?
What media types should be supported?
In the case of JSON, should the implementation support other UTF encodings besides UTF-8?
How will the solution support versioned / custom media types?

kgriffs commented 10 years ago

Consider doing this in Talons instead...

kgriffs commented 9 years ago

In the duplicate issue #42, @lichray has a good comment. Reproducing here since #42 has been closed:

There is one question before this: do we support non-UTF-8 JSON? According to rfc4627, UTF-16/32 LE/BE JSON are valid. If we support them, then we need to handle encoding in auto serialization/(denationalization, possibly). I know the requests library does it, FYI.

juanqui commented 9 years ago

Might also be a good idea to consider integration with a library such as http://marshmallow.readthedocs.org/en/latest/ -- by integration here, I probably mean a 'plugin' sort of like a talon, to facilitate integration. It would help provide functionality similar to that of django-rest-framework and django-tastypie (both of which I have used extensively). Just an idea!

I currently write before/after hooks that validate the "content_type" and do the serialization/deserialization for me. In my particular use case, I actually use both XML and JSON. I setup these hooks at the API level, rather than at the Route/View level.

Great framework btw! I jumped into it yesterday and I've already written three small services, two of which are in pre-production and working great.

kgriffs commented 9 years ago

Adding comment from @sebasmagri from duplicate issue #324:

I'd like to think about a Content Type handlers approach which would allow devs to get and return native objects without worrying about the content type in responders, and implement generic deserializers and serializers for each content type they want to be able to handle.

sebasmagri commented 9 years ago

I still believe the Content Type Handler approach would make sense as in:

Providing a standard interface to implement and register content type handlers, as well as the elements they could access (params, files) and the parameters they could get (just as for middlewares)
To provide at least two standard/reference implementations in the core. They could be JSON and Forms (form data, multipart) if they don't introduce new dependencies. Handlers with third party dependencies could be included in Talons.

Upon these two premises we could add more features like being able to use a content-type handler globally or for specific endpoints (like for image processing or custom chat protocols), probably by inheritance or reuse of some logic from middlewares.

kgriffs commented 9 years ago

I like the idea of being able to register Content-Type handlers (with JSON available out of the box). But I am reluctant to overload existing Request and Response properties since this may lead to some confusion in practice. One suggestion from @BenjamenMeyer on the mailing list was to add additional well-named properties, such as 'req.json' to expose the serialization mechanism. But how would we extend this model to arbitrary Internet media types per @sebasmagri's suggestion? Perhaps we might add a generic attribute to the Request and Response classes, e.g.: req.media and resp.media.

One concern I have with this last approach is that req.media would consume the input stream. We would need to decide if that is OK or if it would lead to violating the Principle of Least Surprise (this was problematic in the past with URL-encoded form POSTs).

BenjamenMeyer commented 9 years ago

@kgriffs per the req.json and resp.json method...perhaps we could have something that would "register" a handler against the Request and Response objects, for instance:

req = falcon.Request(...)

setattr(req, 'json', json_request_reader)

resp = falcon.Response()
setattr((resp, 'json', json_response_writer)

My only concern here would be figuring out how to handle the property get/set aspects that use the same attribute name to do two different things. Though we could probably avoid that by saying you can only read from the request and write to the response and just leave them as functions, like with Python Requests.

The registration could be handled similarly to adding routes:

self.app.add_media_handler('application/json', json_request_reader, json_request_writer)

tudborg commented 9 years ago

One of the things I like about falcon is that it tries tonot do stuff I don't explicitly tell it to do. Not guessing how my input stream should be handled is definitely one of the forces of Falcon (actually the main selling-point for my initial use of falcon). However, I totally agree that this is a common use, and providing some nice serialization tools is probably a must.

I'll need to give this way more though, but here is my initial suggestion. I suggest adding methods (or properties, maybe?) to Request and Response to handle deserialization and serialization (lazy). The methods/properties should be named and documented to explicitly consume and decode the request body, and try to deserialize it according to the Content-Type header. (being strict is important. Starting to guess at content types that conflict with the passed Content-Type header is a total mess and should def. be avoided)

Imo, consuming the input stream is okay as long as it is explicit from the user. If you need the input stream decoded for some use-case, you probably don't need the raw thing anyway. We should probably store the deserialized thing on the request though (which is why a property makes sense to me).

I'm can't think of a better name than you @kgriffs , so I'll stick to media.

On Request

media is a read-only property that inspects the Content-Type header and matches it against some global (probably stored on the API object) matcher structure (well, the matching should probably be done on the global registry, acting as a router).
- If no match is found, an error is raised.
- If a match is found, the headers and stream (maybe just the Request?) is handed to the matching decoder function. It either returns some data structure or raises an error (probably one of a few predefined so we can handle some common errors, like body to big, decode error, etc. the returned structure is returned to the media property and is cached there.
Setting media raises an error

On Response, same thing, just the other way:

If media is set, try to encode it according to the Accept header, with some sane fallback logic.
Setting media multiple times does nothing. Encode happens at the very end of the Response life-cycle.

Errors in encode/decode should raise and be handled globally with the correct status codes for the content related errors. Most of these errors have specific errors codes.

Most of the responsibilty here is delegated to the encoder/decoder objects we need to be able to register globally somewhere, which I think is the only way to go. Most users outside of the common deserialize this json - serialize this to json will probably need to write custom serializers anyway, so I think we should prioritize making adding custom serializers easy.

On a related note: For the input stream, we could add an intermediat body property that just contains the raw body as read from the stream. That way, when you handle errors you can actually get the body that caused the error. The issue here is ofc that body-size could be giant, etc, so we once again would have to discuss how to handle this (and allow that handling to be customized as it differs from use-case to use-case what is really the "best" way). A sane default might be: Keep up to 1MB in memory, if the stream grows beyond this, roll over to temp disk storage (maybe using https://docs.python.org/3.4/library/tempfile.html#tempfile.SpooledTemporaryFile)

I'm also up for doing some heavy lifting on this feature.

tudborg commented 9 years ago

Alternatively (and maybe more falcony), we could provide all the parts but some assembly required. An example might explain it best:

class MyResource(object):
    def on_get(self, req, resp):
        req.media   # Proxy object for unserializing
        resp.media  # Proxy object for serializing

        req.media.type   # Detected from Content-Type (None if unknown)
        resp.media.type  # Detected from Accept (default type if unkown)

        # Ping / Pong example
        data = req.media.body   # consumes stream, unserializes, result is cached
        resp.media.body = data  # Like setting resp.body, but serializes it

app = falcon.API()
app.add_route('/ping', MyResource())

# serializers should handle their format in both directions
json = JSONSerializer(max_body=1024*1024)
app.add_serializer('application/json', json)
app.add_serializer('application/*+json', json)

# Advanced edition
class MyStreamable(object):
    def on_get(self, req, resp):
        #  Also note that for some formats, the ability to
        #  read one "unit" at the time, and write one unit,
        #  might be desireable.
        #  This would allow us to easily wrap a normal serializer
        #  In a format that can be streamed (like having a json serializer
        #  inside a LineSerializer, or similar)
        #  Would make Server Sent Events a breeze to implement.
        while True:
            message = req.media.read()
            if not message:
                break
            resp.media.write(message)

wshayes commented 9 years ago

I definitely agree with the Principle of Least Surprise. I also like the transformability I've seen so far in Falcon. Ben Meyer's idea of registering new attributes on the Request/Response objects seems like the most flexible and powerful approach while still making it easy to work with content types in a semi-automatic function with expected magic. As long as there was a good cookbook with some examples for that, I have no problems assembling it.

Some really good thinking going on here and overall with Falcon - thank you all so much - very nice platform.

richardolsson commented 9 years ago

EDIT: Accidentally posted prematurely and have updated the comment with full message.

In the service I'm developing (JSON only) we're using the extended Request/Response approach along with a middleware. I wasn't entirely happy with doing so though, because inheritance is always limited, but I needed somewhere to assign the object to be serialized. Basically, our Response classes have a "value" field and a middleware then encodes it (along with an envelope):

class SerializationComponent(object):
    def process_response(self, request, response, resource):
        if hasattr(response, 'value'):
            response.body = json.dumps({
                'data': response.value
            })

I'm not sure why we couldn't just create a middleware that checks content type in process_request() and process_response() and encodes the data accordingly. The only problem is where to store the decoded and to-be-encoded values, which would need some sort of new property on Request and Response respectively.

smcclstocks commented 9 years ago

I'll post my implementation as well to see if it can further the conversation along. Admittedly, I have no clue if my implementation will scale well into the future or work for others but it does for me & my customers.

I have a middleware object for auto selecting the serializer & deserializer based on a table lookup of content-type matches (deserializer) & accept header preferences (serializer). I then basically proxy the selected Deserializer & Serializer objects deserialize() & serialize() methods directly on the request & response objects. I sub-class the falcon request/response objects for this reason among others.

The serializer selection looks like this:

class Middleware(object):
    """ Serializer middleware object """

    def process_resource(self, req, resp, resource):
        """ Process the request after routing.

        Serializer selection needs a resource to determine which
        serializers are allowed.
        """

        if resource:
            mimetypes = [s.MIMETYPE for s in serializers.SERIALIZERS]
            preferred = req.client_prefers(mimetypes)
            serializer = serializers.get_serializer(preferred)

            if not serializer:
                abort(exceptions.RequestNotAcceptable)

            elif serializer not in resource.allowed_serializers:
                abort(exceptions.SerializerNotAllowed)

            else:
                resp.serializer = serializer()

I then have a list of allowed_serializers on my base resource which can be overridden by other resources. Imagine a file upload resource vs standard CRUD resource. In this case the resp.serializer property is actually the instantiated Serializer object that I can then do the following with in my resource:

class Resource(BaseResource):
    """ Single item resource & responders """

    allowed_deserializers = [deserializers.JSONAPIDeserializer]

    def on_get(self, req, resp, uuid):
        """ Find the model by id & serialize it back """

        model = find(uuid)

        resp.last_modified = model.updated
        resp.location = url_for_rtype(model.rtype, model.uuid)

        resp.serialize(model)

This is nice because I don't have to know the serializer at all in the resource or model. The resp.serialize method will be called on the already selected serializer without having to know or care. That means each one of my serializers have to either understand the incoming object to be serialized (model in this case).. or what I actually do but cut out of the example is resp.serialize(model.to_rest()) which normalizes the model for any serializer to have.

This includes things like filtering certain fields, pagination, etc so each serializer doesn't have to know how to do that. All the serializer needs to know how to do is take a normalized data structure & ensure it is RFC or some other spec compliant payload including headers & whatever else may need to be done. In the case of JSONAPI it needs to reconstruct the data structure a bit to ensure compliance but that's the role of the serializer in my app.

smcclstocks commented 9 years ago

My motivation for this by the way is to support file uploads, csv, json, & jsonapi depending on the resources that are being accessed. All of my crud resources support csv & jsonapi while some more specialized aggregation or utility interfaces that only my web app may use are json because the data model doesn't fit with the jsonapi spec at all.

johnlinp commented 7 years ago

How about using auto_parse_json request option just like auto_parse_form_urlencoded? When Falcon supports more media types, just add more request options for them.

kgriffs commented 7 years ago

FWIW, a few add-ons have cropped up to address this in terms of JSON:

I think it would make sense to have something basic that you get out of the box, and then either make it extensible or replaceable by 3rd-party add-ons for advanced use cases.

swistakm commented 7 years ago

I see that this feature is constantly pushed to next versions without any agreement on how it could be designed. I was postponing the implementation of content negotiation in my graceful project (see swistakm/graceful#13).

Since I am doing some major redesigns in my project I decided to give this problem a try and experiment a bit with different approaches. Here my thoughts.

Initial assumptions

I made some assumptions on how the ideal solution would look like from the perspective of my project. Still I tried to keep general idea simple and generic and I think that this perspective would be valuable for other higher-level frameworks like hug

1. Problem scope

The best if we could tackle three problems at once:

Deserialisation of input data sent by the user (request body).
Content negotiation using Accept header.
Serialisation of output data (response body).

Reason for doing all of this within single solution are pretty obvious. APIs that accept input data and return some data must deal with with both serialisation and deserialisation. Most libraries for handling various data formats provide symmetric interfaces e.g. json.loads() vs. json.dumps() and yaml.load() vs. yaml.dump().

Problem 1. is the simplest to deal with: we have to look at Content-Type header and if we support that content type we can try to deserialise data. Problems 2. and 3. are strongly linked. Content negotiation should drive the data serialisation.

2. Global but optional mechanism

From perspective of my own project it would be best to have one solution that can be configured once for whole application without the need to specify content negotiation settings for every resource separately. This should make sense for most applications: if your API speaks in some format on one resource endpoint it should speak that format on every served resource.

This of course holds true only for generic formats for structured data like JSON, XML, YAML, MessagePack etc. Sometimes you will have to serve or read something that does not represent structured data (i.e. dictionaries, lists, etc.). Best examples would be serving images or any other binary data: such content still can be negotiated but inputs and outputs are not the general data structures.

It means that even if developer has defined his content type handling mechanisms globally he should use actual content negotiation mechanism explicitly within resource code. Like in @tbug's approach: you can use request.media but don't have to.

3. limit API changes and reuse existing code

Assumption 2. may suggest that the best place for defining content negotiation is API class. Probably thats true but I believe it is possible to came with solution that can be introduced gradually in order to limit changes in the API. I think that if designed wisely, it could be even polyfilled or backported to older falcon versions. Such approach would be great for higher-level frameworks built on top of falcon. Even if new feature will not be released anytime soon, other projects can vendor part of future code or prepare their own polyfill. The only thing we need is to agree on some interface that will be used in future.

4. Leave as much of responsibility as possible to the framework user.

Like always there are some decisions that needs to be done and they are never easy. For instance:

Should JSON serialisation accept charsets other than UTF-8?
Should content handlers read/write whole body or rather support streams? Should be any approach solution mandatory?
If not using streams should there be any default limits imposed on processed body size (for safety)?
What formats to support?
Should content negotiation also support extra options like for instance indentation?

IMO from the framework's perspective the best approach is to simply avoid making such decisions. If content negotiation layer is pluggable user can very easily decide how what to do and how to support extra formats.

Of course we can provide some implementation of chosen content type handler as a reference and to make falcon easier to use by newcomers (JSON seems like the most obvious choice). User can always decide to use his own implementation or to extend/override existing one.

Discussion of few proposed ideas

This thread is already long and others proposed some interesting ideas. I have experimented with few of them to see what are their pros and cons.

Custom serialisation libraries like marshmallow

This is unfortunately completely different type of serialisation. Libraries like marshmallow just translate objects between different domains but do not perform content type serialisation/deserialisation.

Middleware based approach

Middleware was proposed by few people and it seems like the least invasive approach. They are optional by nature, can be easily extended, and work globally. Still I don't like the idea of monkey-patching the request object like proposed by @smcclstocks. Since we have __dict__ in Request's __slots__ it is now possible but it still seems like a dirty workaround. This may be fine for custom middleware in user's application code but I think that framework should avoid monkey-patching it's own core objects as it creates surprises. Other way around is to leverage context attribute. On the other hand contexts should be user-defined objects and no one expects to find extra data there.

My main objection against middleware in this situation that middleware in my opinion should be something optional even for application logic. They are great for caching or authorisation. In most cases resources will work no matter if they are registered with or without custom middlewares. This of course is not a case for providing database connections for resources and middleware is a standard way in many frameworks for providing such objects.

Also @smcclstocks's solution assumes that there is some global serializers registry and that would be very problematic. Especially if someone would like to provide it's own serializers registry implementation or some imported package would register it's own serializers without user's knowledge.

@BenjamenMeyer's idea of per-type content handlers

As @kgriffs already mentioned it pollutes Request's and Response's namespaces and also does not help in content negotiation with Accept header at all. Also naming would be bit confusing. For instance: does req.json() really suggests that function reads JSON string and returns dictionary?

Still idea for registering content type handlers via API object looks like a good idea because stays in line with current interface design.

@tbug's second idea for api with req.media and resp.media attributes

In my opinion it is almost perfect design from the perspective of falcon user:

covers whole scope of the problem: serialization/deserialization and content negotiation (assumption 1.)
it is fully optional because leaves the decision if to use req.media and resp.media attributes to falcon user. Still, the definition of possible content handlers is global for application (assumption 2.)
- leaves a lot of responsibility and control to the user. Content-type handlers can be custom user classes using well defined interface. Also gives potential to decide on how to handle streams inside of resource code or content handler code. It may even allow for chunked-encoded responses.
It should require a relatively small amount of changes and none of them should be backwards incompatible. It is only a single attribute in Request and Response classes.

The only problem is that it ties directly the initialisation of Response object to the Request object. The resp.media attribute needs to know the Accept header value or at least the result req. client_prefers(). Without that data the Response object cannot resolve content type handler and all this belongs to Request domain. I'm afraid that this cannot be easily changed without breaking backwards compatibility. This is due to possibility to pass custom request and response objects. We should expect that existing users do not expect any extra initialisation arguments. I have an idea how to bypass this problem and I will discuss this later.

I would also not worry that request stream consumption would be a surprise for the user. Note that:

using req.media would be optional.
proper naming (e.g. req.media.consume_stream()) should be enough.
content type handlers can have dual API, both for consuming streams and strings. This would allow to e.g. store raw request body in variable for later use.

Possible solutions and proof of concepts.

In my opinion the @tbug's idea is the good direction. The only problem is this Response initialisation. I have two ideas how to resolve this.

Handling content negotiation only though the request.media atrribute

Since the Request object is the only one that knows content type preference of the user agent it could make sense to have this new media attribute only in Request object. The good point is that such approach can be implemented without any changes in falcon core only through custom request classes:

import json
import yaml

from falcon import Request
from falcon import errors

class JSONHandler:
    def read(self, request):
        return json.loads(request.stream.read().decode('utf-8'))

    def write(self, data, response):
        response.body = json.dumps(data)

class YAMLHandler:
    def read(self, request):
        return yaml.load(request.stream.read().decode('utf-8'))

    def write(self, data, response):
        response.body = yaml.dump(data)

def request_factory(handlers):
    class Media:
        def __init__(self, request):
            self.content_type = request.content_type
            self.client_prefers = request.client_prefers(handlers.keys())

        def read_from_request(self, request):
            if self.content_type not in handlers:
                raise errors.HTTPUnsupportedMediaType

            return handlers[self.content_type].read(request)

        def write_to_response(self, data, response):
            if self.client_prefers is None:
                raise errors.HTTPNotAcceptable

            handlers[self.client_prefers].write(data, response)

    class MediaRequest(Request):
        @property
        def media(self):
            return Media(self)

    return MediaRequest

Note that JSONHandler and YAMLHandler here are very simple as I assume that their actual implementation is out of scope of this feature.

Configuration even without any dedicated support in the API class is very simple:

api = application = falcon.API(request_type=request_factory(
    {
        'application/yaml': YAMLHandler(),
        'application/json': JSONHandler(),
    }
))

Unfortunately the actual usage is a bit unsatisfying:

class Echo:
    def on_post(self, req, resp):
        payload = req.media.read_from_request(req)
        req.media.write_to_response(resp, payload)

It is very short and simple but does not look intuitive at all. It will work for project like graceful or hug where users almost never work with raw req & resp objects. Maybe it could be improved by choosing proper names but will never be as intuitive and expressive as @tbug's original approach.

Allowing to delegate req and resp initialisation

The backwards incompatibility problem of extended Response object initialization can be reduced in time by providing the additional function in API class that would allow to affect how new req and resp instances are initialised:

class API():
    ...

    def create_req_resp(request_type, response_type, env, req_options, resp_options):
        req = request_type(env, options=req_options)
        resp = response_type(options=resp_options)
        return req, resp

    ...

Then this feature could be introduced gradually over next releases:

introduce req/resp initialisation hook
introduce media attribute to the Request and Response objects
change the Response initialisation signature

The additional advantage of this approach is that improves framework extensibility and maybe introduce a way for some other use cases. Falcon users could experiment with req/resp initialisation and decide over time if @tbug's approach really makes sense. From perspective of higher level frameworks frameworks the first step is simply enough to provide this style of content negotiation.

Also we can go even further and allow to provide req/resp initialisation function as a new API class keyword argument to avoid the need for subclassing.

BenjamenMeyer commented 7 years ago

@swistakm read through your proposal...quite interesting and very good write-up. Thanks!

One thing I would caution - let's not try to be too Smart about this. One reason I initially got introduced to Falcon was b/c the Pecan + WebOb framework we were using was being too smart and didn't allow us to do what we wanted to do as it had its own ideas, etc on how things should work which were not necessary true in all cases - for instance, we had an API that took in JSON and at times generated JSON back, but at times it should just return a 201/204 Status; Pecan made that impossible - if we told it the API took in JSON we had to spit out a JSON response and a 200 Status - the JSON response ended up being []. Let's not make that kind of mistake in Falcon.

So I do favor a more decentralized approach of having the deserialization/serialization in the Request/Response objects but leave it to the actual framework users to decide whether or not to use them - whether or not to honor the ACCEPT Header, and whether or not to input of one type necessarily means output of that same time - IOW, Response knows nothing about the Request aside from what the developer tells it.

That said, it should be easy for the Developer to make the interconnection between the Request and Response objects if they wanted to enforce that functionality and be extremely strict about it - but that's a dev's choice, not the framework's choice - and given Falcon's tagline of Falcon is a very fast, very minimal Python web framework doing too much would probably go against that.

$0.02 on it.

jmvrbanac commented 7 years ago

A lot of good ideas on this issue. This is a tough issue to address in a flexible enough way to solve a good percentage of use-cases. Over the next couple weeks, I'm going to be using a lot of the feedback and ideas here to help put together a solution. Keep your eyes out for a PR.

kgriffs commented 7 years ago

Everyone, please take a look at https://github.com/falconry/falcon/pull/1050 and share your thoughts.

BenjamenMeyer commented 7 years ago

@kgriffs overall I like it; I'd move away from the default value and add the abstract class for the handlers, but it looks promising.

falconry / falcon