JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
321 stars 25 forks source link

dtypes don't serialize properly when using nested models with alias generators #86

Closed timoguin closed 1 month ago

timoguin commented 4 months ago

Problem

When using Pydantic alias generators, nested types are not serializing properly for dtypes.

My use case is that I have data coming from an API that is in camelCase. I validate against that format using the to_camel() alias generator. Serialized data should always be in snake case.

The bug can be reproduced with the following code:

from pydantic import AliasGenerator, ConfigDict
from pydantic.alias_generators import to_camel, to_snake

class BaseModel(pt.Model):
    model_config = ConfigDict(
        alias_generator=AliasGenerator(
            validation_alias=to_camel,
            serialization_alias=to_snake,
        ),
        populate_by_name=True,
    )

class NestedModel(BaseModel):
    nested_field: int

class ParentModel1(BaseModel):
    parent_field: int
    nested_model: NestedModel

When calling dtypes on NestedModel, things are serialized properly:

In [3]: NestedModel.dtypes
Out[3]: {'nested_field': Int64}

However, when calling dtypes on ParentModel, the columns for NestedModel are back to camelCase:

In [14]: ParentModel.dtypes
Out[14]: {'parent_field': Int64, 'nested_model': Struct({'nestedField': Int64})}

Serialization works as expected (can be initialized with camelCase):

In [16]: foo = ParentModel(parent_field=1, nested_model={'nestedField': 2})

In [17]: foo.model_dump()
Out[17]: {'parent_field': 1, 'nested_model': {'nested_field': 2}}

Solution

I've recently updated all my dependencies and am not sure if this is a new issue or one that already existed. I have a branch where I've added the above code as an initial test and have played with the mode="serialization" flag for model_dump_json(), but so far I haven't figured out the issue.

That branch is linked below.

It's worth noting that, without populate_by_name=True set on the model config, camelCase fields will fail validation. I think this is a newer flag, as well as the mode option for model dumping.

References

thomasaarholt commented 4 months ago

Hi @timoguin! Great catch on this! Thank you for the comprehensive issue report! Can you try this branch/PR and see if it resolves your issues? https://github.com/JakobGM/patito/pull/87

I still need to add a few tests, preferably covering both the validation_alias and serialization_alias behaviors. If you'd like to help with that, I'd really appreciate it!

timoguin commented 3 months ago

Hi @timoguin! Great catch on this! Thank you for the comprehensive issue report! Can you try this branch/PR and see if it resolves your issues? #87

I still need to add a few tests, preferably covering both the validation_alias and serialization_alias behaviors. If you'd like to help with that, I'd really appreciate it!

I'd love to if I can find the time, but I doubt I'll be able to any time soon.

But I can confirm that this does indeed fix the issue I was having. It's much appreciated! Thanks for providing the library. It's been nice to work with. 😄

To supply some context around why I ran into this, I had a nested field that was ending up in Polars with one of the struct fields specified as a null type. This is fine in Polars apparently but was causing an exception when attempting to write the dataframe to a Delta Lake table. I said "no prob, I got this Patito..." and tried to use it to pass the dtypes explicitly. But the dtypes were a lie!