aws-powertools / powertools-lambda-python

A developer toolkit to implement Serverless best practices and increase developer velocity.
https://docs.powertools.aws.dev/lambda/python/latest/
MIT No Attribution
2.88k stars 396 forks source link

Support Python native dataclass APIs in Event Source Data Classes utility #1177

Closed LouisAmon closed 2 years ago

LouisAmon commented 3 years ago

What were you initially searching for in the docs?

I'm using the @event_source decorator and I was looking for a way to print my event in my Cloudwatch logs.

When I simply print(event), I just get a class instance, but I want to be able to see the actual keys and values :)

Being used to working with Python dataclasses, I naturally tried asdict:

print(event.asdict())

Or even...

import dataclasses

dataclasses.asdict(event)

...and I was very surprised by the result:

asdict() should be called on dataclass instances

Is this related to an existing part of the documentation? Please share a link

https://awslabs.github.io/aws-lambda-powertools-python/latest/utilities/data_classes/

Describe how we could make it clearer

Dataclasses are meant to be a Pythonic way to define a data model.

If you define your own class (DictWrapper in your case) which inherits from nothing then all the APIs and principles behind dataclasses are lost.

Dataclasses are used in libraries such as sqlalchemy or strawberry to define a unified way to communicate with a database or build a GraphQL API... dataclasses provides unity between all these, so why not AWS Lambda ? :)

If you have a proposed update, please share it here

I think we should either revamp the code to actually define proper data classes, or stop calling it as such : it's just confusing.

What do you think ?

boring-cyborg[bot] commented 2 years ago

Thanks for opening your first issue here! We'll come back to you as soon as we can.

boring-cyborg[bot] commented 3 years ago

Thanks for opening your first issue here! We'll come back to you as soon as we can.

heitorlessa commented 3 years ago

Hey @LouisAmon thanks for raising this. We have created awslabs/aws-lambda-powertools-python#346 to rewrite the entire documentation on dataclass utility - naming is hard. This next release is 80% dedicated to that doc.

This Friday, I'll start revamping that doc and lay down ambitions for v2 - that includes changing dataclasses to more prescriptive so we don't break anyone in v1.

If you have ideas for naming I'd love to hear them so we can improve everyone's experience. If we could have a better name by the time the new docs are completely rewritten, then I could easily create code aliases and redirect docs altogether

Thank you!

LouisAmon commented 3 years ago

Well to be honest I think dataclasses are indeed a best practice in Python, especially when it comes to data modeling and typing

May I ask why the project didn't go into actually using the @dataclass decorator ?

I guess now that a huge amount of code is in place, you can't just add @dataclass on top of DictWrapper... but still, have you tried ?

I know some projects have encountered limitations with the way Python's standard dataclasses are meant to work. For example Pydantic ended up defining their BaseModel and their own @dataclass decorator, but overall they tried to stay consistent with the Python API

Maybe the right way for AWS Lambda Powertools would be to piggyback on Pydantic when it comes to validation (using their @dataclass), or using the regular Python decorator for "normal" data modelling

Another solution would be to define your own @dataclass, but then also make sure that the regular APIs (such as asdict()) are available

TBH I think Pydantic adds a great deal of value to Lambda payload processing, although it is a bit heavy as a dependency :)

michaelbrewer commented 3 years ago

@LouisAmon i added this original as a helper for the event classes, and over time it started to include more features like handling decoding and then decorators for specific use cases.

The objective was to work with Python 3.6 and not have any additional dependencies. So @dataclasses is not available in Python 3.6 without a backport. I am sure once we deprecate Python 3.6 support, we can strip down the code alot.

So we could create a lightweight version of @dataclass with asdict(), for now there is raw_event that you can use if you want the dict.

michaelbrewer commented 3 years ago

We (@heitorlessa and @cakepietoast) did discuss the naming of this in the original PR (https://github.com/awslabs/aws-lambda-powertools-python/pull/159#issuecomment-694342213). I am not sure any name would have worked :). But we tried our best.

Maybe a warning note or something docs could help reduce any potential confusion, and then adding asdict() method, could be another alias of raw_event property.

Otherwise i am pretty proud of how many people this has helped. The next features that are related to this is some common Exception types needed for API Gateway Proxy events, and then better integration with the Idempotent decorator.

LouisAmon commented 3 years ago

We (@heitorlessa and @cakepietoast) did discuss the naming of this in the original PR (#159 (comment)). I am not sure any name would have worked :). But we tried our best.

Don't get me wrong: you guys did an outstanding job overall, thanks for that !

Maybe a warning note or something docs could help reduce any potential confusion, and then adding asdict() method, could be another alias of raw_event property.

Maybe, but it feels to me like a half-measure : either you're working with a dataclass with the full API or you're simply working on a data model (which is very fine, many SQLA / Django users are familiar with that term and what it means)

For example instead of:

from aws_lambda_powertools.utilities.data_classes import event_source, APIGatewayProxyEventV2

@event_source(data_class=APIGatewayProxyEventV2)
def lambda_handler(event: APIGatewayProxyEventV2, context):
    if "helloworld" in event.path and event.http_method == "POST":
        do_something_with(event.json_body, event.query_string_parameters)

I'd rather see:

from aws_lambda_powertools.utilities.models import event_source, APIGatewayProxyEventV2

@event_source(model=APIGatewayProxyEventV2)
def lambda_handler(event: APIGatewayProxyEventV2, context):
    if "helloworld" in event.path and event.http_method == "POST":
        do_something_with(event.json_body, event.query_string_parameters)

That's my opinion anyway :)

LouisAmon commented 3 years ago

Overall, if your concern is to stick with concepts that exist in Python 3.6 then maybe it's best to avoid the dataclasses terminology altogether no ?

My personal suggestion would be to be technically compatible with Python 3.6, yet bring them one step into the future (so to speak) by using a dataclasses backport

With the backport in place, you can also use the dataclasses in Pydantic, and stick to an all-dataclass terminology and API throughout the project

It would make even more sense to make that change regarding Envelopes and the parser feature : the base model could be defined as a dataclass and the envelope would be a dataclass as well !

Having a base model defined as a data class allows you to do cool stuff like generating fake data on the fly or defining a default_factory, in a way that's consistent with other open source libraries

My two cent on the matter 😄

michaelbrewer commented 3 years ago

We (@heitorlessa and @cakepietoast) did discuss the naming of this in the original PR (#159 (comment)). I am not sure any name would have worked :). But we tried our best.

Don't get me wrong: you guys did an outstanding job overall, thanks for that !

Maybe a warning note or something docs could help reduce any potential confusion, and then adding asdict() method, could be another alias of raw_event property.

Maybe, but it feels to me like a half-measure : either you're working with a dataclass with the full API or you're simply working on a data model (which is very fine, many SQLA / Django users are familiar with that term and what it means)

For example instead of:

from aws_lambda_powertools.utilities.data_classes import event_source, APIGatewayProxyEventV2

@event_source(data_class=APIGatewayProxyEventV2)
def lambda_handler(event: APIGatewayProxyEventV2, context):
    if "helloworld" in event.path and event.http_method == "POST":
        do_something_with(event.json_body, event.query_string_parameters)

I'd rather see:

from aws_lambda_powertools.utilities.models import event_source, APIGatewayProxyEventV2

@event_source(model=APIGatewayProxyEventV2)
def lambda_handler(event: APIGatewayProxyEventV2, context):
    if "helloworld" in event.path and event.http_method == "POST":
        do_something_with(event.json_body, event.query_string_parameters)

That's my opinion anyway :)

Which part is different? If we make changes now it would have breaking changes for existing people.

michaelbrewer commented 3 years ago

dataclasses

Yep, it was originally event sources, and it could be models, but it is data_classes so not exactly dataclasses ;-)

It is quick too refactor, but it would involve breaking changes for all the library users.

So maybe for V2 when Python 3.6 is deprecated, and whatever this is call moves up a level.

So note that this was my first python contribution coming from a Kotlin/Java background, so forgive the lack of pythonic terms used.

note, as a contributor and user of the library, I do like that this feature does not introduce any additional dependencies or slows down the lambda cold starts.

michaelbrewer commented 3 years ago

@heitorlessa - I would propose we switch back to what I had originally suggested as event_sources and frame it as classes, utilities and helpers for the AWS standard events. And we can target this for V2 (and when Python 3.6 lambdas are deprecated), and then we can potentially minimize the amount of code.

LouisAmon commented 3 years ago

I agree with you both that making a breaking change just for syntactic changes is not worth it.

I would also propose to make the big change in v2, but at that point : why not consider actually using Python dataclasses ? I mean, since 3.6 is gonna be dropped and all...

By the way, you can even keep the 3.6 support with the backport... and keep the old syntax with a DeprecationWarning for a few versions... :)

Shall we close the ticket ? I realise I should've opened a discussion thread instead 🙏

michaelbrewer commented 3 years ago

@LouisAmon this library is supposed to run on Python 3.6 and higher, but you are welcome to make a PR with your proposed changes

heitorlessa commented 3 years ago

@LouisAmon - Thank you for your strong opinions and suggestions, we appreciate your enthusiasm to help us make everyone's experience better.

I'd like to clarify a few things as I turn this issue into a feature request and move forward with other higher priorities tasks.

Dataclasses not being added in day one. Back then, VSCode and PyCharm didn't have intellisense support for them resulting in either false positives or no autocompletion at all. This led us to consider two routes: a) attrs as that's been the de-facto for class boilerplate and handy data methods since 2015 including support for 3.6+ with all dataclass features with a more rapid release cycle, or b) dataclasses backport for 3.6 only, though we had tiny concerns like it was always installing for 3.6+ and not 3.6 only, and whether additional features and types would've been backported to 3.6 on a release cycle.

attrs seemed the perfect candidate considering release cycle but we hit the same problem with autocompletion, and VSCode Pylance was not GA at the time either. We decided not to risk it, keep the original logic @michaelbrewer did using @property as that worked for all IDEs on autocompletion with a helpful text to fields, and deal with the naming per se later as we'd figure a better name given the handy methods it has.

Using pydantic instead. As you mentioned, Pydantic adds at minimum ~35M using the latest CFlag optimizations with pip. When using SAM CLI however, it PythonPipBuilder doesn't support these new optimizations leading to an additional 91M vs 35M, as it pulls wheels and binaries for Linux and doesn't support the env vars in pip discussed in the thread. For this reason, we decided not to use pydantic however easier it'd have been to maintained it'd directly incur additional charges for customers and not aligning with our tenets.

This in itself brings us to the major decision factor in v2 on whether to use implicit namespace packages (PEP 420) so you could install each utility separately along with their own dependencies, or make all dependencies optional since this whole package is ~1.2M. The latter is another discussion I need to write into a new issue to hear from the community before we cut v2, since Boto alone went from 35M to 64M in an year (~10M increase in the past 2 months alone).

Moving forward from here

I'm turning this issue into a feature request and adjusting the title so the language is more accommodating to what this community represents. These are the changes we'd be comfortable to make or welcome PRs at this stage to not having to wait until v2 as there are bigger priorities like the boto package size above:

I hope this helps and I'll concentrate on recreating the documentation for this utility as initially planned to start today.

Thank you all

michaelbrewer commented 3 years ago

Good suggestions @heitorlessa.

Just as a note, raw_event is just returning the wrapped dictionary object (self._data), so when changing this to asdict() method people may think we are generating a new dictionary object? Also having both could be confusing too.

Another note about the “event_sources” implementation, if it was implemented as just a wrapper to the original event dict, making it as light weight as possible. So a very large dict that only needs to have a single key look up, does not need to perform a full parse, i would like to keep this behavior.

michaelbrewer commented 2 years ago

@heitorlessa

No much will happen on this issue until Python 3.6 and whether dataclasses actually delivers the savings we expect.

I would just propose a rename

heitorlessa commented 2 years ago

Yeah we discussed of renaming it earlier either way

On Fri, 12 Nov 2021 at 19:17, Michael Brewer @.***> wrote:

@heitorlessa https://github.com/heitorlessa

No much will happen on this issue until Python 3.6 and whether dataclasses actually delivers the savings we expect.

I would just propose a rename

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/aws-lambda-powertools-python/issues/481#issuecomment-967320433, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZPQBF3E7FO3ZH6IUJGNUDULVK4NANCNFSM46VLHF2A .

heitorlessa commented 2 years ago

Closing as we will only be able to rename it completely in v3. Data classes confirmed to be too slow. We're following Pydantic v2 closely for a faster and closer API model. As there isn't a date, raw_event property does the job.

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.