JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
252 stars 23 forks source link

Add support for pydantic 2.0, polars 0.20.10 and remove duckdb support #32

Closed brendancooley closed 5 months ago

brendancooley commented 9 months ago

builds upon #4, *all existing tests (excepting `tests/test_duckdb/) passing after upgrade topydantic==2.4.2,polars==0.19.11`**

Closes #11 Closes #28 Closes #26 ...possibly closes others

Summary

TODO

And I'm sure there are other issues introduced by these changes, and bugs that the existing test suite does not yet catch. But hopefully this serves as a template for discussion and new test collection and can help move the ball forward on getting this package onto pydantic2. Feedback very welcome.

ion-elgreco commented 9 months ago

I'll build this one locally and see if it will fix the other bugs Patito 0.50 still has

ion-elgreco commented 9 months ago

They were also working on v2 here: https://github.com/JakobGM/patito/tree/jakobgm%2Fpatito-v2

brendancooley commented 9 months ago

I'll build this one locally and see if it will fix the other bugs Patito 0.50 still has

Anxious to hear how it goes

ion-elgreco commented 9 months ago

@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 https://github.com/JakobGM/patito/pull/20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.

One thing that still will be broken is the examples creation, you need to include this fix https://github.com/JakobGM/patito/pull/20 and then build on top of it since now a type which is Optional[str] will have the parameter minimum behind a list in the anyOf key.

I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)

brendancooley commented 9 months ago

@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.

One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is Optional[str] will have the parameter minimum behind a list in the anyOf key.

I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)

happy to include the test. did you mean to reference a different issue/pr in your second comment? not clear on how #20 helps with example string generation but perhaps I am missing something.

ion-elgreco commented 9 months ago

@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.

One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is Optional[str] will have the parameter minimum behind a list in the anyOf key.

I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)

happy to include the test. did you mean to reference a different issue/pr in your second comment? not clear on how #20 helps with example string generation but perhaps I am missing something.

20 is a fix regarding, creating proper examples with numerical values. I meant just taking that fix directly into pydantic v2 won't work. I saw that grabbing the minimum field is now nested if you have a field that is optional

ion-elgreco commented 9 months ago

@brendancooley tried it on our repo and tests, I am also getting this now static type hint error:

Expression of type "FieldInfo" cannot be assigned to declared type "int"
  "FieldInfo" is incompatible with "int"

on this code:

class Test(pt.Model):
    test: int= pt.Field(constraints=pl.struct("a", "b").is_unique())
ion-elgreco commented 9 months ago

Also nullable columns are not working, mre:

class Test(pt.Model):
    foo: str|None = pt.Field(dtype=pl.Utf8)

print(Test.nullable_columns)
set()

Result should be: {'foo'}

brendancooley commented 9 months ago

Thanks @ion-elgreco, I'll add these tests to the PR and start hacking at them tomorrow. Have not looked much into structs but it would be nice if we had more tests for those.

ion-elgreco commented 9 months ago

Thanks @ion-elgreco, I'll add these tests to the PR and start hacking at them tomorrow. Have not looked much into structs but it would be nice if we had more tests for those.

This example where I am using a struct is just a simple trick to do multiple column uniqueness as a constraint.

One thing that is missing though in Patito but probably out of scope for the migration is proper struct support as a dtype

brendancooley commented 9 months ago

Also nullable columns are not working, mre:

class Test(pt.Model):
    foo: str|None = pt.Field(dtype=pl.Utf8)

print(Test.nullable_columns)
set()

Result should be: {'foo'}

@ion-elgreco should be addressed by https://github.com/JakobGM/patito/pull/32/commits/e2bf0d7351ca8750affcbae8f65ce0e48357dcd5

brendancooley commented 9 months ago

https://github.com/JakobGM/patito/pull/32/commits/017c59b2373705059b951b6a22f8f6dd787c851a also adds some sanity checking that the annotated types match the allowable polars types specified in Field.dtype

brendancooley commented 9 months ago

@brendancooley tried it on our repo and tests, I am also getting this now static type hint error:

Expression of type "FieldInfo" cannot be assigned to declared type "int"
  "FieldInfo" is incompatible with "int"

on this code:

class Test(pt.Model):
    test: int= pt.Field(constraints=pl.struct("a", "b").is_unique())

yes I can replicate these in vscode if I enable

{
  "python.analysis.typeCheckingMode": "basic"
}

in settings.json. Looking into it.

brendancooley commented 9 months ago

@brendancooley tried it on our repo and tests, I am also getting this now static type hint error:

Expression of type "FieldInfo" cannot be assigned to declared type "int"
  "FieldInfo" is incompatible with "int"

on this code:

class Test(pt.Model):
    test: int= pt.Field(constraints=pl.struct("a", "b").is_unique())

This should be handled by https://github.com/JakobGM/patito/pull/32/commits/161300b80c1f898d7246b46bd0a918e6ab818426

brendancooley commented 9 months ago

@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes. One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is Optional[str] will have the parameter minimum behind a list in the anyOf key. I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)

happy to include the test. did you mean to reference a different issue/pr in your second comment? not clear on how #20 helps with example string generation but perhaps I am missing something.

20 is a fix regarding, creating proper examples with numerical values. I meant just taking that fix directly into pydantic v2 won't work. I saw that grabbing the minimum field is now nested if you have a field that is optional

Yes this makes sense. Let me know if https://github.com/JakobGM/patito/pull/32/commits/9e132bc49baebaf3cf874c3469e61c811df88b74 fixes the issue @ion-elgreco.

ion-elgreco commented 9 months ago

I see more bugs at the moment :) with the types it's checking for and the examples it's creating:

examples ignores the dtype when it's an optional field, see test below:

class Test(pt.Model):
    foo: int | None = pt.Field(dtype=pl.UInt32)

assert pl.DataFrame(schema={'foo':pl.UInt32}).schema == Test.examples().schema

Also, this seems executes while it shouldn't:

class Test(pt.Model):
    foo: str | None = pt.Field(dtype=pl.UInt32)

Lastly with datetimes the mapping is not correct:

class Test(pt.Model):
    date: datetime = pt.Field(dtype=pl.Datetime(time_unit="us"))

`ValueError: Invalid dtype Date for column 'date'. Check that specified dtype is allowable for the given type annotations.`
brendancooley commented 9 months ago

🐞👟 @ion-elgreco keep em coming, hopefully https://github.com/JakobGM/patito/pull/32/commits/93e171e3424f39269ccabbc6b31e00219ed77bd0 addresses with better logic for column nullability

ion-elgreco commented 9 months ago

@brendancooley nice, all my original unit-tests are running now without any issue from what I can see: )

ion-elgreco commented 9 months ago

Also very nice to see this issue will also get resolved now: https://github.com/JakobGM/patito/issues/11. I think you can add all these issues I've mentioned to the - closes in the pr description 🕺. The new version will be huge 😄

One suggestion on the dtype mismatches, it could be rewritten to this to clarify what the current data schema dtypes is and what the expected model dtype is. The data field dtype Float32 does not match the model schema field dtype: int | None (type=type_error.columndtype)

The old one does not immediately give away what the expected data dtype should be.

foo
  Polars dtype Float32 does not match model field type. (type=type_error.columndtype)
ion-elgreco commented 9 months ago

Actually found a couple bugs:

.examples() does not work when you have this class:

class Test(pt.Model):
    foo: list[str] | None  = pt.Field(dtype=pl.List(pl.Utf8))

Test.examples({"foo":[['test']]})

# results into
TypeError: argument of type 'NoneType' is not iterable

This is also possible, but shouldn't be since the type annotation doesn't match the pt.Field dtype. It seems for any list type annotation there is no corresponding check with the patito dtype, I think you need do something with recursion here to check all of these nested complex types. Ideally you should be able to provide complex nested dtypes like: list[list[list[datetime]]]. Then this would also work later with pl.Stuct https://github.com/JakobGM/patito/issues/30 and then you can create really nice validation models.

class Test(pt.Model):
    foo: list[str] | None  = pt.Field(dtype=pl.Utf8)

I also noticed, the pt.Field docs are missing at the moment: Field.__doc__ = FieldDoc.__doc__ this does not seem to work properly. Also, I think it would make more sense if we can see the parameters in the function definition, otherwise there is no functioning auto completion and it's unclear (outside of reading the docs) which parameters are valid or not.

If you need any help, just let me know ;)

brendancooley commented 9 months ago

Also very nice to see this issue will also get resolved now: #11. I think you can add all these issues I've mentioned to the - closes in the pr description 🕺. The new version will be huge 😄

One suggestion on the dtype mismatches, it could be rewritten to this to clarify what the current data schema dtypes is and what the expected model dtype is. The data field dtype Float32 does not match the model schema field dtype: int | None (type=type_error.columndtype)

The old one does not immediately give away what the expected data dtype should be.

foo
  Polars dtype Float32 does not match model field type. (type=type_error.columndtype)

Like this idea, cleaned up the messaging a bit on https://github.com/JakobGM/patito/pull/32/commits/e9dbca6b0d6986fc3b76bda074e26d099c291b1e

Should see errors messages like e.g.

"Invalid dtype UInt32 for column 'foo'. Allowable polars dtypes for Union[str, NoneType] are: Utf8."
brendancooley commented 9 months ago

@thomasaarholt also dropped the classproperties set via the ModelMetaclass. Definitely cleaner than the metaclass+classvar formulation. Went back and forth a bit but decided to adopt polars' classproperty decorator to try to avoid getting caught up in python debates about whether to preserve this functionality: https://stackoverflow.com/questions/76249636/class-properties-in-python-3-11

Test suite does fine with this but @ion-elgreco lmk if you run into issues.

ion-elgreco commented 9 months ago

@brendancooley I got lost a bit, what is the different way of approach class properties solving here?

brendancooley commented 9 months ago

@brendancooley I got lost a bit, what is the different way of approach class properties solving here?

The original implementation used a combination of ClassVar and patito.pydantic.ModelMetaclass to set class properties such as Model.columns. Here's the original note:

    # -- Class properties set by model metaclass --
    # This weird combination of a MetaClass + type annotation
    # in order to make the following work simultaneously:
    #     1. Make these dynamically constructed properties of the class.
    #     2. Have the correct type information for type checkers.
    #     3. Allow sphinx-autodoc to construct correct documentation.
    #     4. Be compatible with python 3.7.
    # Once we drop support for python 3.7, we can replace all of this with just a simple
    # combination of @property and @classmethod.

Since we're on python^3.8 now I think pattern can be safely deprecated. The natural way to do this is to use

class Model(BaseModel):

    @classmethod
    @property
    def columns(cls):
        ...

however, it appears python does not plan on supporting this pattern in the future.

In our case, it's extremely useful to be able to mark classproperties this way, so I adopted polars' custom @classproperty implementation in order for us to be able to replace the dual @classmethod, @property syntax with

class Model(BaseModel):

    @classproperty
    def columns(cls):
        ...

and not worry about what python does with @classmethod and @property in future versions

brendancooley commented 9 months ago

Actually found a couple bugs:

.examples() does not work when you have this class:

class Test(pt.Model):
    foo: list[str] | None  = pt.Field(dtype=pl.List(pl.Utf8))

Test.examples({"foo":[['test']]})

# results into
TypeError: argument of type 'NoneType' is not iterable

This is also possible, but shouldn't be since the type annotation doesn't match the pt.Field dtype. It seems for any list type annotation there is no corresponding check with the patito dtype, I think you need do something with recursion here to check all of these nested complex types. Ideally you should be able to provide complex nested dtypes like: list[list[list[datetime]]]. Then this would also work later with pl.Stuct #30 and then you can create really nice validation models.

class Test(pt.Model):
    foo: list[str] | None  = pt.Field(dtype=pl.Utf8)

I also noticed, the pt.Field docs are missing at the moment: Field.__doc__ = FieldDoc.__doc__ this does not seem to work properly. Also, I think it would make more sense if we can see the parameters in the function definition, otherwise there is no functioning auto completion and it's unclear (outside of reading the docs) which parameters are valid or not.

If you need any help, just let me know ;)

I think the examples generation issue is fixed by https://github.com/JakobGM/patito/pull/32/commits/db44aa344edfa51fec9a9bc114a187b5e780f081

Field.__doc__ works ok for me? Do you have an example of where it's not appearing? I went ahead and put the patito-custom fields in the Field signature, but left off the pydantic was. Idea being that we want this signature to be robust to possible changes in pydantic's Field signature. Do we think version 2 is stable enough for us to copy their args over directly? I haven't followed closely followed the roadmap over there.

ion-elgreco commented 9 months ago

@brendancooley I got lost a bit, what is the different way of approach class properties solving here?

The original implementation used a combination of ClassVar and patito.pydantic.ModelMetaclass to set class properties such as Model.columns. Here's the original note:

    # -- Class properties set by model metaclass --
    # This weird combination of a MetaClass + type annotation
    # in order to make the following work simultaneously:
    #     1. Make these dynamically constructed properties of the class.
    #     2. Have the correct type information for type checkers.
    #     3. Allow sphinx-autodoc to construct correct documentation.
    #     4. Be compatible with python 3.7.
    # Once we drop support for python 3.7, we can replace all of this with just a simple
    # combination of @property and @classmethod.

Since we're on python^3.8 now I think pattern can be safely deprecated. The natural way to do this is to use

class Model(BaseModel):

    @classmethod
    @property
    def columns(cls):
        ...

however, it appears python does not plan on supporting this pattern in the future.

In our case, it's extremely useful to be able to mark classproperties this way, so I adopted polars' custom @classproperty implementation in order for us to be able to replace the dual @classmethod, @property syntax with

class Model(BaseModel):

    @classproperty
    def columns(cls):
        ...

and not worry about what python does with @classmethod and @property in future versions

That makes sense! Thanks for explaining, wasn't aware of these differences.

Actually found a couple bugs: .examples() does not work when you have this class:

class Test(pt.Model):
    foo: list[str] | None  = pt.Field(dtype=pl.List(pl.Utf8))

Test.examples({"foo":[['test']]})

# results into
TypeError: argument of type 'NoneType' is not iterable

This is also possible, but shouldn't be since the type annotation doesn't match the pt.Field dtype. It seems for any list type annotation there is no corresponding check with the patito dtype, I think you need do something with recursion here to check all of these nested complex types. Ideally you should be able to provide complex nested dtypes like: list[list[list[datetime]]]. Then this would also work later with pl.Stuct #30 and then you can create really nice validation models.

class Test(pt.Model):
    foo: list[str] | None  = pt.Field(dtype=pl.Utf8)

I also noticed, the pt.Field docs are missing at the moment: Field.__doc__ = FieldDoc.__doc__ this does not seem to work properly. Also, I think it would make more sense if we can see the parameters in the function definition, otherwise there is no functioning auto completion and it's unclear (outside of reading the docs) which parameters are valid or not. If you need any help, just let me know ;)

I think the examples generation issue is fixed by db44aa3

Field.__doc__ works ok for me? Do you have an example of where it's not appearing? I went ahead and put the patito-custom fields in the Field signature, but left off the pydantic was. Idea being that we want this signature to be robust to possible changes in pydantic's Field signature. Do we think version 2 is stable enough for us to copy their args over directly? I haven't followed closely followed the roadmap over there.

I don't have a reproducible example for that but it was when using Pyright and Pylance in VS code, the Field signature was empty. I checked your commit, but I don't think you will see the pydantic parameters now, such as (ge, le, lt, unique, pattern etc.). Ideally these are also visible in signature,

Also note to myself, I need stop quoting because the comments become immense long haha

brendancooley commented 9 months ago

I don't have a reproducible example for that but it was when using Pyright and Pylance in VS code, the Field signature was empty. I checked your commit, but I don't think you will see the pydantic parameters now, such as (ge, le, lt, unique, pattern etc.). Ideally these are also visible in signature,

Thinking on the best way to accomplish this. Would be nice if we could develop a nice framework for extending FieldInfo (and updating the signature for Field) for arbitrary sets of new arguments. This would make it easy for users to customize Field as necessary for their specific use cases.

Would be a lot easier if pydantic's Field was actually a class and not a function that spits back out a FieldInfo.


Tried combining pydantic args and patito args like this

field_sig = inspect.signature(Field)
pyd_field_params = inspect.signature(fields.Field).parameters
Field.__signature__ = inspect.Signature(
    parameters=[pyd_field_params["default"]]
    + [v for v in field_sig.parameters.values() if v.name != "kwargs"]
    + [v for v in pyd_field_params.values() if v.name not in ["default", "extra"]]
    + [field_sig.parameters["kwargs"]],
    return_annotation=field_sig.return_annotation
)

but vscode does not seem to pick up on the signature update. I'm loath to just copy over the pydantic arguments because I suspect they will not be stable...many are already scheduled for deprecation.


https://github.com/JakobGM/patito/pull/32/commits/438974c807c3917da93f2b7264b5f9e4172af7f0 experiments with trying to make the FieldInfo class more extensible, by allowing users to define extension classes (as pydantic Models) and generate the Field function dynamically. I use this functionality to generate an extended FieldInfo with patito custom fields, but demonstrate how it can be use to construct arbitrary FieldInfo extensions in test_model.test_custom_field_info. This makes it very easy for users to customize their own Field objects, which is certainly useful for my use cases and has been requested by some pydantic users (see https://github.com/pydantic/pydantic/issues/6454). Since these classes are dynamically constructed however, it is very difficult to get the typing right and to generate a valid signature for the Field function. I've resorted to manually typing PatitoFieldInfo, which helps resolve the patito-custom attributes, but leaves the dynamically constructed Field objects without good typing.

ion-elgreco commented 9 months ago

@brendancooley @thomasaarholt What is still required to push this over the edge?

I foresee I have more time next couple weeks to help here so let me know.

brendancooley commented 9 months ago

@brendancooley @thomasaarholt What is still required to push this over the edge?

I foresee I have more time next couple weeks to help here so let me know.

  1. Do we want to continue explicit duckdb support? @ion-elgreco maybe take a pass at this and see if it's easy to get the tests to pass again? If not, delete or deprecate existing code. fwiw I've been using duckdb to work with patito-modeled tables and haven't felt that explicit support is necessary.
  2. There are still a few typing pain points, mainly pydantic.field. I'm inclined to leave this as is for now but open to suggestions for better usability.
  3. Refactor directory structure as described above (separate Model/Field) to better mirror pydantic internal structure
  4. Updated readme and documentation where sparse. @thomasaarholt does anything need to be done explicitly to keep the readthedocs page up to date? https://patito.readthedocs.io/en/latest/index.html
ion-elgreco commented 9 months ago

@brendancooley I think patito should only focus on Polars for now, I doubt many people use it for the DuckDB functionality.

thomasaarholt commented 8 months ago

I think we should drop duckdb support too :)

@brendancooley could you clarify for me how users can extend their requirements using the new Field/FieldInfo? I'm a bit hesitant to introduce something that doesn't get a static docstring like Field does currently. Doesn't the Field(constraints=[<some polars expressions>]) already let one add custom column requirements?

brendancooley commented 8 months ago

I think we should drop duckdb support too :)

@brendancooley could you clarify for me how users can extend their requirements using the new Field/FieldInfo? I'm a bit hesitant to introduce something that doesn't get a static docstring like Field does currently. Doesn't the Field(constraints=[<some polars expressions>]) already let one add custom column requirements?

Currently, arguments not explicitly enumerated in pydantic's Field factory get thrown into json_schema_extra. This is an issue first and foremost because the feature is scheduled for deprecation. It is also a problem because pydantic expects json-serializable arguments. Polars Expr are not currently serializable so this raises an error when we call model_json_schema internally with constraints or derived_from expression args. See 28 for discussion on this.

A secondary issue is that pydantic does not make it easy to subclass FieldInfo. They make use of __slots__ to explicitly enumerate it's attributes, and track values passed by users in _attributes_set in order to create minimal representations of the field info object. All of the FieldInfo attributes (except annotation and required) not passed by the user default to _Unset and are ignored by the __repr__ constructor, which makes for nice pretty-printing of the field. For example, passing

from patito.pydantic import fields
f = fields.Field(ge=0)
f

returns FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0)]) without ancillary attribute_x=None.

These constraints have admittedly made it difficult to cleanly add patito-specific attributes to the FieldInfo objects. The solution here attempts to allow arbitrary extensions to the FieldInfo object. Extension attributes are typed as pydantic BaseModel instances. For patito, this looks like

class PatitoFieldExtension(BaseModel):
    """Stores type hints and default values for patito-custom field attributes"""

    constraints: pl.Expr | Sequence[pl.Expr] | None = _Unset
    derived_from: str | pl.Expr | None = _Unset
    dtype: PolarsDataType | None = _Unset
    unique: bool | None = _Unset

    model_config = ConfigDict(arbitrary_types_allowed=True)

The patito-side field_info function takes a collection of extension classes and constructs a subclass of pydantic's FieldInfo, with __slots__ updated with the new arguments from the extension classes. Extension attributes are set in the __init__, where we also update the _attributes_set attribute to reflect the addition of the new arguments.

Then, the field function builds a factory for our custom FieldInfo, splitting arguments into those that are pydantic-native and those that belong to extensions. pydantic-base arguments are passed via pydantic.Field to ensure that pydantic-side argument processing occurs. Then we extract attributes that have been set on the pydantic FieldInfo and pass them, along with our custom arguments, to the patito-side FieldInfo.

The resulting API is equivalent to the previous version, albeit without nice typing

Field = field(exts=[PatitoFieldExtension])

Users can call this Field function to generate patito-specific FieldInfos as they did before. tests.test_model.test_custom_field_info enumerates some expectations for these custom FieldInfo objects.

A nice property of this implementation is that users can now extend patito's FieldInfo class by passing their own extensions to the patito.pydantic.field factory. This should also help facilitate easy maintenance of the library as we add new features.

The big downside of all of this is that the typing is very difficult. I am very hesitant copy over pydantic's current Field signature because many of the arguments are already scheduled for deprecation. Users running different versions of pydantic might have different effective signatures for patito.Field. See, for example, min_items, but there are many others. But it should be possible to collect arguments passed via the extension classes and those that are in the pydantic.Field signature for user's current pydantic version and spit out a dynamic docstring. I experimented with this a bit but it was difficult to implement cleanly.

I'm happy to flesh out the existing documentation on these functions a bit and perhaps add examples. But having some discussion here to align on a vision for how this should work seems valuable. Very open to other ideas or feedback on my representation of how the pydantic-side stuff works. The pydantic folks do seem a little unwilling to support custom FieldInfo objects so I worry our solution may need to be a little hacky if we intend on keeping the existing API, see this issue for the feature request.

thomasaarholt commented 8 months ago

Currently, arguments not explicitly enumerated in pydantic's Field factory get thrown into json_schema_extra. This is an issue first and foremost because the feature is scheduled for deprecation.

Actually, it's the extra argument that is being deprecated, in favour json_schema_extra json_schema_extra :)

brendancooley commented 8 months ago

Currently, arguments not explicitly enumerated in pydantic's Field factory get thrown into json_schema_extra. This is an issue first and foremost because the feature is scheduled for deprecation.

Actually, it's the extra argument that is being deprecated, in favour json_schema_extra json_schema_extra :)

Yes, but extra aliases unenumerated keyword args, right? So in the future they (pydantic) plan on supporting extra keyword args only via json_schema_extra. So if we plan were to adhere to these constraints we would need to ensure that our extra arguments are json-serializable.

thomasaarholt commented 8 months ago

My suggestion is going to be to use this approach of passing them from the patito Field into the pydantic Field, but I'm thinking through your comments now. I'm currently writing a longer post with a suggestion. Wanted to keep conversation going though, so wrote the above first.

thomasaarholt commented 8 months ago

This is the reason polars-serialization suddenly started throwing errors in pydantic btw. I think their changes made sense, but were unfortuante for us. https://github.com/pydantic/pydantic/pull/7625

thomasaarholt commented 8 months ago

So, both your and jakob/mine implementations currently go via the pydantic json interface (model_json_schema) in order to do a lot of functionality enabled by patito. I propose that instead, we use the BaseModel.model_fields interface, which exposes the fields information as a FieldInfo.

This means that we don't try to serialize any types, and can use pt.Field in the approach I commented above.

We would still under the hood be passing non-jsonable content to json_schema_extra, but while those arguments are intended to be json-serializable, I don't think they would try to serialize them unless explicitly called for.

I'm working in a notebook right now, going to try to at least share my approach tonight.

ion-elgreco commented 8 months ago

@brendancooley I remember @thomasaarholt created an upstream issue at Polars to make expressions serializable. I don't think this has been picked up yet.

I could take a look and see if I can work on that. I've been dangling with Rust myself in delta-rs and some plugins, so perhaps a fun challenge to build some rust code in Polars ^^

ion-elgreco commented 8 months ago

Actually nvm, it's already possible it seems :): https://github.com/pola-rs/polars/issues/12172

thomasaarholt commented 8 months ago

Hehe, since creating that issue, I've learnt two additional things:

  1. Polars datatypes aren't (yet) serializable
  2. We still need some way to provide a serializing mechanism in pydantic to deserialize into a serializable dict and then re-serialize using pydantic. The issues are a) pydantic doesn't know about .write_json(), and b) can't just concatenate two json strings directly.

To clarify on the second point, if we want to keep the JSON-approach, there are two places we might want the fallback mechanism to support serializing polars objects:

import polars as pl
from pydantic import BaseModel, Field

class Foo(BaseModel):
    type_example: pl.Int64 # here as a type hint
    extra_example: int = Field(json_schema_extra={"expression_example": pl.col("foo")}) # here as an extra

Foo.model_json_schema() # raises an error with either of the above commented-out

For the type hint, we would either have to:

  1. add a field_serializer (which I actually can't manage to get work), but then we would have to add one for every column of a given name.
  2. subclass all types we want to work with, and add a __get_pydantic_core_schema__ (example here).
  3. add support in polars for __get_pydantic_core_schema__ on the classes we want pydantic to serialize.

For extras example, I think we can only achieve it with points 2 or 3.

thomasaarholt commented 8 months ago

I just want to add - I find the pydantic serialization documentation really complicated, and I'm struggling quite a bit with implementing it on a subclass of pl.Float64. My thoughts in the previous post might be wrong, and I'd be very happy to be corrected.

brendancooley commented 8 months ago

@thomasaarholt Catching up and checking my understanding...if we want to pass and store patito-specific field attributes via pydantic's json_schema_extra, we need to either:

  1. remove all references to model_json_schema on the patito side and warn users that this is not a supported attribute for patito models (or possibly implement our own schema representation) OR
  2. support serialization of polars expressions and dtypes, either by implementing pydantic's field_serializer or by getting polars to support it natively

On the question of deserialization...suppose we can serialize dtypes and expressions, do we need to be able to go backward from json -> python object in order for this to approach to work as well? Is this supported by polars?

Could we use the fallback kwarg in pydantic's to_jsonable_python to pass serialization instructions for these? We'd need to allow this keyword to be passed around during schema construction on the pydantic side but maybe that's more flexible then working with the field_serializer approach.

brendancooley commented 8 months ago

Also @thomasaarholt if your notebook scratch work is able to be shared I'd be happy to take a look and try to get up to speed on your approach.

thomasaarholt commented 8 months ago

Yes! I've just gone over it again, here it is: https://gist.github.com/thomasaarholt/c04ae8ad503a1476e5282e7098eeb1f7

I've called the result for each column a ColumnInfo class. I just needed a name that didn't have Field in it, cause I was going a little crazy 😅 .

thomasaarholt commented 8 months ago

Catching up and checking my understanding...if we want to pass and store patito-specific field attributes via pydantic's json_schema_extra, we need to either:

  1. remove all references to model_json_schema on the patito side and warn users that this is not a supported attribute for patito models (or possibly implement our own schema representation) OR
  2. support serialization of polars expressions and dtypes, either by implementing pydantic's field_serializer or by getting polars to support it natively

Yep! But on 2, field_serializer is not the way to go, it was just me trying to make pydantic somehow serialize a polars type. (I didn't manage it)

I don't think we need patito models to be json serializable. At least, I haven't been doing that, though I guess it might be nice to be able to save the patito schema as a json? In that case we could just add our own to_json and from_json methods. We can just convert a pl.Float64 to a "PolarsFloat64" string via a simple mapping, and then convert to and from like that.

I see that polars has added Expr.from_json, so that can read back in the result of Expr.write_json. We can probably use exactly these methods.

brendancooley commented 8 months ago

Came across an issue with write_json worth discussing here. polars expression serialization (perhaps unsurprisingly) is not supported for maps over python callables. For example, the following code:

from typing import Dict, datetime
import polars as pl

def convert_time_zone(struct: Dict) -> datetime:
    if struct["timezone_id"] is None:
        return None
    return (
        struct["datetime_utc"]
        .astimezone(pytz.timezone(struct["timezone_id"]))
        .replace(tzinfo=None)
    )
expr = pl.struct(["datetime_utc", "timezone_id"]).map_elements(convert_timezone)
expr.meta.write_json(None)

raises

ValueError: Error("error raised in python: AttributeError: Can't pickle local object 'Expr.map_elements.<locals>.wrap_f'", line: 0, column: 0)

I suspect this is a fairly common pattern for patito users. Minimal patito model using the above expression looks like:

import patito as pt
from pydantic import AwareDatetime

class DatetimeModel(pt.Model):
    datetime_utc: AwareDatetime = pt.Field(dtype=pl.Datetime(time_zone="UTC"))
    timezone_id: str | None
    datetime_local: datetime | None = pt.Field(
        default=None,
        derived_from=pl.struct(["datetime_utc", "timezone_id"]).map_elements(
            convert_time_zone
        ),
        dtype=pl.Datetime,
    )

@thomasaarholt any thoughts on how we ought to handle this case?

ion-elgreco commented 8 months ago

@brendancooley I honestly don't think this is a common pattern. You can do nearly anything with the Polars API or you use a plugin.

brendancooley commented 8 months ago

@brendancooley I honestly don't think this is a common pattern. You can do nearly anything with the Polars API or you use a plugin.

@ion-elgreco yeah I may just not be polars savvy enough to do this the "right" way. do you have a suggestion for how to implement the above expression in pure polars?

ion-elgreco commented 8 months ago

@brendancooley I can take a look if you have some sample data

brendancooley commented 8 months ago

@brendancooley I can take a look if you have some sample data

@ion-elgreco sure, how about this

from typing import Dict
from datetime import datetime
import pytz

import polars as pl

def convert_time_zone(struct: Dict) -> datetime:
    if struct["timezone_id"] is None:
        return None
    return (
        struct["datetime_utc"]
        .astimezone(pytz.timezone(struct["timezone_id"]))
        .replace(tzinfo=None)
    )

df = pl.DataFrame({"datetime_utc": [datetime.now(), datetime.now()], "timezone_id": ["America/New_York", None]})
df = df.with_columns(pl.col("datetime_utc").cast(pl.Datetime(time_zone="UTC")), pl.col("timezone_id"))
df.with_columns(datetime_local=pl.struct(["datetime_utc", "timezone_id"]).map_elements(convert_time_zone))  # current solution