falconry / falcon

The no-magic web data plane API and microservices framework for Python developers, with a focus on reliability, correctness, and performance at scale.
https://falcon.readthedocs.io/en/stable/
Apache License 2.0
9.49k stars 933 forks source link

Request: offer a way to validate `get_media()` as part of decoding the stream data to Python #2202

Open onecrayon opened 7 months ago

onecrayon commented 7 months ago

Currently there is no clean way (using standard internal Falcon methods) to validate and decode incoming data within the Request stream in a single step. This can matter because libraries such as Pydantic v2 and msgspec include native validation at the time of deserialization/decoding which generally outperforms converting from JSON to a Python dict and then reading the dict into an object (with validation). Pydantic v2, in particular, does not offer any way that I know of to convert from a JSON string into a dict in any performant way (it always assumes you are converting into a Pydantic schema).

Arguably, the best way to go about this currently is to use custom middleware that does an end-run around get_media() completely. For instance:

import msgspec

class ValidatedMedia:
    def process_resource(self, req, resp, resource, params):
        # Fetch the expected msgspec struct somehow; here we assume it is a class property named like `PatchSchema`
        schema_class = getattr(resource, f'{req.method.title()}Schema', None)
        req.context.validated_media = msgspec.json.decode(req.stream.read(), type=schema_class)

It would be nice if there were some natively-supported method for doing this, though, without needing to police the project to make sure all views use req.context.validated_media (in this example) instead of the more standard req.get_media(). Unfortunately, it is not currently possible to achieve this with a custom Handler class alone, because the handler only receives the stream contents, content type, and content length from get_media().

I don't have any specific suggestions for how this could best be implemented in Falcon. When I was originally looking through the Falcon source with an eye towards making this work, I considered allowing get_media() to pass arbitrary keyword-arguments through to the handler (which would allow something like req.get_media(type=schema_class) within the view logic itself to pass a msgspec schema straight through to a custom Handler class). However, this introduces indeterminate return types for get_media() based on whatever the custom Handler may be doing, which arguably is even less desirable than needing to rely on a Middleware-defined property.

vytas7 commented 7 months ago

Thanks for the discussion on Gitter, and taking time to write down a very nice summary here!

I don't know what the best way is either, but I agree it would be nice to streamline this scenario. Adding kwargs to req.get_media() might be problematic since the current kwargs do not alter the way media is deserialized as they only control the behaviour wrt missing media, i.e., the caller can implement the same effect by catching exceptions without having any other prior knowledge. As media is cached upon first access, another middleware component that runs earlier might even bypass validation if we added kwargs that control representation per the above suggestion.

CaselIT commented 7 months ago

However, this introduces indeterminate return types for get_media() based on whatever the custom Handler may be doing, which arguably is even less desirable than needing to rely on a Middleware-defined property.

I don't think this is a particular issue, since the multipart handler does returns a custom object https://falcon.readthedocs.io/en/stable/api/multipart.html

As media is cached upon first access, another middleware component that runs earlier might even bypass validation if we added kwargs that control representation per the above suggestion.

this is the biggest problem I see of such a solution.

I considered allowing get_media() to pass arbitrary keyword-arguments through to the handler (which would allow something like req.get_media(type=schema_class) within the view logic itself to pass a msgspec schema straight through to a custom Handler class)

I think if we were to go with this system then we get_media should have a kwarg that's passed as kwarg to the handler. so this example would be something like req.get_media(handler_kwargs=dict(type=schema_class))

vytas7 commented 7 months ago

@CaselIT how would kwargs help in the case of cached media? Or would we buffer the stream somehow, and invalidate upon different kwargs?

CaselIT commented 7 months ago

They would be used only when first calling get_media. I don't think it's a good change to replace caching that we have in place

CaselIT commented 7 months ago

Honestly it would not make for a great api since this would be allowed

a_foo = req.get_media(handler_kwargs=dict(type=Foo))
user_wants_a_bar_but_gets_a_foo = req.get_media(handler_kwargs=dict(type=Bar))

Of course this is an edge case that's unlikely to be realistic, but it's still not great.

I honestly don't know how to make it better

vytas7 commented 7 months ago

Agreed, I don't think it is easy to incorporate this into the current API.

We could make broader changes to the framework to afford this scenario, but that of course requires more thought. Options include:

onecrayon commented 7 months ago

What else?

I don't know if this makes much sense, but one option that comes to mind is that instead of having get_media() pass kwargs through in some way, there could be some officially-sanctioned method for setting the media cache explicitly in the instance that multiple layers might be trying to access the stream/get_media. Not sure of the performance implications, but you might have a signature like:

def set_media_cache(cached: Union[dict, Object, Callable])

If you pass in a dict or other object, it just gets assigned to the media cache, and returned next time anything calls get_media(). If you pass in a callable, it gets executed (and the results stored in the cache) the next time get_media() is called.

So my middleware that wants to validate the stream might do something like:

try:
    validated_media = msgspec.json.decode(req.stream.read(), type=schema)
    req.set_media_cache(partial(msgspec.structs.todict, validated_media)
except WhateverGetsThrownWhenStreamIsAlreadyConsumed:
    validated_media = msgspec.convert(req.get_media(), type=schema)

req.context.validated_media = validated_media