JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
270 stars 23 forks source link

Add support for pydantic 2.0, polars 0.20.10 and remove duckdb support #32

Closed brendancooley closed 6 months ago

brendancooley commented 10 months ago

builds upon #4, *all existing tests (excepting `tests/test_duckdb/) passing after upgrade topydantic==2.4.2,polars==0.19.11`**

Closes #11 Closes #28 Closes #26 ...possibly closes others

Summary

TODO

And I'm sure there are other issues introduced by these changes, and bugs that the existing test suite does not yet catch. But hopefully this serves as a template for discussion and new test collection and can help move the ball forward on getting this package onto pydantic2. Feedback very welcome.

ion-elgreco commented 9 months ago

@brendancooley how about this, just pl.when.then:

import polars as pl
from datetime import datetime
import pytz

df = pl.DataFrame({
    'id': [1,2],
    "date_col": [datetime(2020,10,10, tzinfo=pytz.UTC), datetime(2020,10,10, tzinfo=pytz.UTC)],
    'timezone': ['Europe/London', 'Africa/Kigali']
})

exprs =[
    pl.when(pl.col('timezone')==tz)
    .then(pl.col('date_col').dt.convert_time_zone(tz).dt.replace_time_zone(None))  
    for tz in ['Europe/London', 'Africa/Kigali']
    ]

df.with_columns(
    pl.coalesce(exprs)
)
brendancooley commented 9 months ago

@brendancooley how about this, just pl.when.then:

import polars as pl
from datetime import datetime
import pytz

df = pl.DataFrame({
    'id': [1,2],
    "date_col": [datetime(2020,10,10, tzinfo=pytz.UTC), datetime(2020,10,10, tzinfo=pytz.UTC)],
    'timezone': ['Europe/London', 'Africa/Kigali']
})

exprs =[
    pl.when(pl.col('timezone')==tz)
    .then(pl.col('date_col').dt.convert_time_zone(tz).dt.replace_time_zone(None))  
    for tz in ['Europe/London', 'Africa/Kigali']
    ]

df.with_columns(
    pl.coalesce(exprs)
)

very nice, and yes it does serialize

pl.coalesce(exprs).meta.write_json(None)
'{"Function":{"input":[{"Ternary":{"predicate":{"BinaryExpr":{"left":{"Column":"timezone"},"op":"Eq","right":{"Literal":{"Utf8":"Europe/London"}}}},"truthy":{"Function":{"input":[{"Function":{"input":[{"Column":"date_col"}],"function":{"TemporalExpr":{"ConvertTimeZone":"Europe/London"}},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}},{"Literal":{"Utf8":"raise"}}],"function":{"TemporalExpr":{"ReplaceTimeZone":null}},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}},"falsy":{"Literal":"Null"}}},{"Ternary":{"predicate":{"BinaryExpr":{"left":{"Column":"timezone"},"op":"Eq","right":{"Literal":{"Utf8":"Africa/Kigali"}}}},"truthy":{"Function":{"input":[{"Function":{"input":[{"Column":"date_col"}],"function":{"TemporalExpr":{"ConvertTimeZone":"Africa/Kigali"}},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}},{"Literal":{"Utf8":"raise"}}],"function":{"TemporalExpr":{"ReplaceTimeZone":null}},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}},"falsy":{"Literal":"Null"}}}],"function":"Coalesce","options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":true,"returns_scalar":false,"cast_to_supertypes":true,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}}'
thomasaarholt commented 9 months ago

We can just add that we don't support python callables like the ones used in map_elements. I agree with @ion-elgreco that it's not really a problem. In general, avoid using map_elements in your code :) Only use it if you can't find a polars expression that does the same thing, since polars expressions will be 100 times faster.

thomasaarholt commented 9 months ago

Just to demonstrate, here is a gist of how we can serialize polars expressions using the write_json interface within pydantic models. In this case I'm reading the jsonified expression back in as a dict, since that is a nice type in python that you could do stuff with, but really we might just want to re-serialize the write_json string. It will end up double-escaped in the json file though, which I think is undesirable.

import json, polars as pl
expr = pl.col("foo") == 2

json.dumps(expr.meta.write_json())
# '"{\\"BinaryExpr\\":{\\"left\\":{\\"Column\\":\\"foo\\"},\\"op\\":\\"Eq\\",\\"right\\":{\\"Literal\\":{\\"Int32\\":2}}}}"'
thomasaarholt commented 9 months ago

Yes! I've just gone over it again, here it is: gist.github.com/thomasaarholt/c04ae8ad503a1476e5282e7098eeb1f7

I've called the result for each column a ColumnInfo class. I just needed a name that didn't have Field in it, cause I was going a little crazy πŸ˜… .

I've just updated this with a bit more structure. This is my suggestion for how we avoid having to serialize anything to JSON.

Bed now.

brendancooley commented 9 months ago

@thomasaarholt a couple questions on the gist.

Take your simple model:

class SimpleExample(BaseModel):
    id: str                                                                    # pl.Utf8, required, not nullable, not unique, str, no constraints
    name: str                                                                  # pl.Utf8, required, not nullable, not unique, str, no constraints
    int_with_dtype_value: int = Field(json_schema_extra={"dtype": pl.Int16()}) # pl.Int16, required, not nullable, not unique, int, no constraints
    not_required_bc_has_default: bool = True                                   # pl.Boolean, not required, nullable, not unique, bool, no constraints
  1. Is the intended API that users pass patito-specific field attributes via json_schema_extra?
  2. How will pydantic-side field attributes be handled? e.g. ge, gt, multiple_of, min_length... ?
  3. Is the intention for the expression/dtype serialization helpers to be implemented on the ColumnInfo class? Then when we override BaseModel.model_fields we know that pydantic will be able to successfully call model_json_schema?
  4. What is the purpose of the type hint processors in the gist? Do we not already have some of this functionality implemented in Model._valid_dtypes?
thomasaarholt commented 9 months ago

Great questions (really, it helps give me clarity too)

  1. No. The intention will be to use a pydantic-like Field (link is to the branch that Jakob and I worked on) that will take the arguments you listed directly and pass them into pydantic's Field.
  2. See 1. :)
  3. My intention would be that the Patito Model overrides the model_json_schema method with something we construct that exports a working json string using the methods discussed with the gist above.
  4. Model._valid_dtypes gets its info from Model._schema_properties which itself calls Model.model_json_schema. That means that it will try to serialize the polars types and expressions, which will fail. So my idea is that we create the valid_dtypes without going through the json route, which means parsing the type hints ourselves.
thomasaarholt commented 9 months ago

Note that in my SimpleExample, I am inheriting from BaseModel and not Patito's Model just because I needed an example that would run without installing Patito. When this all works we would be inheriting from a Patito Model that has e.g. an overwritten model_json_schema and model_dump_json.

brendancooley commented 9 months ago

Great questions (really, it helps give me clarity too)

  1. No. The intention will be to use a pydantic-like Field (link is to the branch that Jakob and I worked on) that will take the arguments you listed directly and pass them into pydantic's Field.
  2. See 1. :)
  3. My intention would be that the Patito Model overrides the model_json_schema method with something we construct that exports a working json string using the methods discussed with the gist above.
  4. Model._valid_dtypes gets its info from Model._schema_properties which itself calls Model.model_json_schema. That means that it will try to serialize the polars types and expressions, which will fail. So my idea is that we create the valid_dtypes without going through the json route, which means parsing the type hints ourselves.

I like the idea of trying to override model_json_schema and handling the serialization on our end. But if we do that then we should be able to use Model._valid_dtypes as originally intended, right?

thomasaarholt commented 9 months ago

Huh. Yes, I guess πŸ˜… That might be much more sensible! Would you like to try giving it an attempt? I will be a bit busy until Friday, but might be able to squeeze some of this in.

brendancooley commented 9 months ago

Huh. Yes, I guess πŸ˜… That might be much more sensible! Would you like to try giving it an attempt? I will be a bit busy until Friday, but might be able to squeeze some of this in.

I'll give it a go!

brendancooley commented 9 months ago

Think we have a nice solution available here. There is a little work to do on the API and how to store and access the column info but wanted to share the serialization progress before proceeding. As you suggested, let's store patito-specific field attributes on a pydantic model:


from typing import Sequence, Any
import json

from pydantic import BaseModel, field_serializer
import polars as pl
from polars.datatypes import convert, DataTypeClass

class ColumnInfo(BaseModel, arbitrary_types_allowed=True):
    dtype: DataTypeClass | None = None  # TODO polars migrating onto using instances?  https://github.com/pola-rs/polars/issues/6163
    constraints: pl.Expr | Sequence[pl.Expr] | None = None
    derived_from: str | pl.Expr | None = None
    unique: bool | None = None

    @field_serializer('constraints', "derived_from")
    def serialize_exprs(self, exprs: str | pl.Expr | Sequence[pl.Expr] | None) -> Any:
        if exprs is None:
            return None
        elif isinstance(exprs, str):
            return exprs
        elif isinstance(exprs, pl.Expr):
            return self._serialize_expr(exprs)
        elif isinstance(exprs, Sequence):
            return [self._serialize_expr(c) for c in exprs]
        else:
            raise ValueError(f"Invalid type for exprs: {type(exprs)}")

    def _serialize_expr(self, expr: pl.Expr) -> Dict:
        if isinstance(expr, pl.Expr):
            return json.loads(expr.meta.write_json(None))  # can we access the dictionary directly?
        else:
            raise ValueError(f"Invalid type for expr: {type(expr)}")

    @field_serializer('dtype')
    def serialize_dtype(self, dtype: DataTypeClass | None) -> Any:
        """

        References
        ----------
            [1] https://stackoverflow.com/questions/76572310/how-to-serialize-deserialize-polars-datatypes
        """
        if dtype is None:
            return None
        elif isinstance(dtype, DataTypeClass):
            return parse_composite_dtype(dtype)
        else:
            raise ValueError(f"Invalid type for dtype: {type(dtype)}")

We use the field_serializer decorators to ensure that polars expressions and dtypes are serialized properly. Very clean implementation on the dtype serialization from @radugrosu on stack overflow:

def parse_composite_dtype(dtype: DataTypeClass) -> str:
    if dtype.is_nested:  # TODO deprecated, move onto lookup
        return f"{convert.DataTypeMappings.DTYPE_TO_FFINAME[dtype.base_type()]}[{parse_composite_dtype(dtype.inner)}]"
    else:
        return convert.DataTypeMappings.DTYPE_TO_FFINAME[dtype]

we can go backward (from string to dtype) with the polars helper dtype_short_repr_to_dtype

def dtype_from_string(v: str):
    """for deserialization"""
    return convert.dtype_short_repr_to_dtype(v)

Now we can create an example pydantic model with ColumnInfo objects shoved into the json_schema_extra (we will later expose a nice API for users to pass patito-specific attributes directly to the Field constructor):

from pydantic import fields

class Model(BaseModel):
    a: int
    b: int = fields.Field(json_schema_extra={"column_info": ColumnInfo(constraints=[(pl.col("b") < 10)])})
    c: int = fields.Field(json_schema_extra={"column_info": ColumnInfo(derived_from=pl.col("a") + pl.col("b"))})
    d: int = fields.Field(json_schema_extra={"column_info": ColumnInfo(dtype=pl.UInt8)})
    e: int = fields.Field(json_schema_extra={"column_info": ColumnInfo(unique=True)})

With our custom field_serializers this serializes no problem πŸŽ‰ :

schema = Model.model_json_schema()
schema['properties']['c']['column_info']['derived_from']
>>> {'BinaryExpr': {'left': {...}, 'op': 'Plus', 'right': {...}}}
schema['properties']['d']['column_info']['dtype']
>>>'u8'

@thomasaarholt let me know if this is consistent with how you were thinking about this problem. I know you mentioned having issues with the field_serializers...does this solve some of the problems you were confronting? If you like this general approach I'll work on redesigning the patito Model internals to assume that the schema can be fully serialized, and try to provide a nicer frontend for patito Field construction.

thomasaarholt commented 9 months ago

I think this looks really great! Please go ahead!

Polars merged the dtype instance thing this weekend with release scheduled immenently in 0.18.0!

brendancooley commented 9 months ago

Edits with a first cut at the ColumnInfo implementation here

Serialization via model_json_schema now works nicely due to the pydantic field_serializers, but we expose a Model.column_infos property to allow the model to access the unserialized versions of patito-specific fields. Got all of the existing tests to pass and wanted to get this up to share progress. I'm going to write some more tests and some documentation tomorrow morning and will update with a more detailed usage guide then. As far as I can tell this does not alter the pt.Field API at all, users should have same experience as in pydantic v1.

Still some work to do to get the signature to populate nicely, assume right now it will only show the patito-specific fields in autocompletions.

brendancooley commented 8 months ago

https://github.com/JakobGM/patito/pull/32/commits/5f2b89e21d2301cfe3d7ec3a19b694c59ae88f21 creates a standalone dtype inference and validation module in the spirit of @thomasaarholt's wip gist. The main public functions are:

  1. default_polars_dtype_for_annotation: takes a python annotation and returns single polars dtype, or None if no polars dtype is considered valid for the annotation.
  2. valid_polars_dtypes_for_annotation: takes a python annotation and returns a set of polars dtypes that are valid. Returns an empty set if no dtypes are considered valid for the annotation.

We now call validate_polars_dtype on any column for which dtype is specified and validate_annotation on any column for which it is not when a ModelMetaclass is instantiated. This allows us to proactively validate users' models and ensure that their annotations are consistent with their dtypes and vice versa. If the user tries to pass create the following model:

import patito as pt

class InvalidModel(pt.Model):
    text: str = pt.Field(dtype=pl.Float32)

a ValueError will be raised informing them that pl.Float32 is not valid for str annotated fields. Conversely if the user attempts to create a model with an unsupported type

class InvalidModel2(pt.Model):
    dict: Dict

a ValueError will be raised informing them that we could not find a polars dtype that supports the Dict annotation.

These make use of pydantic's TypeAdapter and I think make the dtype component of patito much more modular and extensible. I believe they are more robust to nested type annotations than the older code. And it should be much easier to add (for example) support for pl.Struct dtypes in the future (@ion-elgreco).

Note that pl.Float32 and pl.Float64 are now considered valid dtypes for integer annotations. I don't see any reason why this shouldn't be permitted, and it flowed very naturally from the way I set up the dtypes module, but I'm open to arguments that we should be more restrictive.


Now that we have the ColumnInfo objects I think we can cleanup the example generation and validation components of patito a bit as well, relying a little bit less on the parsing of the model_json_schema and exposing standalone components for easier testing. But this work could be tackled in the future.

A few more short term to dos:

ion-elgreco commented 8 months ago

I think pinning the new version to Polars>0.20.0 is fine, also dropping DuckDB makes sense imho.

Btw @brendancooley i have a PR open in Polars for your specific usecase to create a local datetime ;)

thomasaarholt commented 8 months ago

Awesome!

Just a few thoughts based on your bullet points! (yes to all!)

  • [ ] are we committed to dropping duckdb support? I can delete associated code there if so

Yes. We can always add it back later if we find it requested.

  • [ ] take a pass at the readme to make sure it is up to date with changes, are there other places where documentation should be added?

There is the docs directory which compiles into these docs. It would be great to go over that and ensure that it is up-to-date.

  • [ ] (optional) pydantic is now annotating json_schema_extra with JsonDict which is a fairly restrictive dictionary type. We may want to serialize our ColumnInfo prior to passing to pydantic.fields.Field in order to comply, but we'd need to work out the deserialization there first.

I need to take a deeper look here, but I think I get what you mean.

I asked on stackoverflow, and learnt a way to serialize based on the field type using model_serializer, rather than using field_serializer. I played around and wrote a gist that shows how to do this AND de-serialize with dataframes and expressions, and I think it should work with your solution for dtypes as well. That might reduce the field_serialize code in this PR by a little bit, and hopefully also let us deserialize the polars types too.

  • [ ] with the release of polars 0.20.0 we now have access to pl.Enum dtypes, which make a lot more sense than categorical to use with Literal annotations. Can migrate this logic over if we're in agreement on that (and willing to require polars>=0.20)

That seems very sensible to me!

  • [ ] I've jumped ahead a bit and used the python 3.10 syntax (int | None) in a bunch of places in the tests. If we want to keep python 3.9 support I can go back and turn these into Optionals and ensure that we pass the tests with python 3.9

As long as you aren't "instantiating" type aliases using the | operator (e.g. Foo = Bar | Baz) (note the equals sign), but only doing them as a type hints (e.g. foo: Bar | Baz), you can import from __future__ import annotations and you can use the new syntax in Python 3.9.

ion-elgreco commented 8 months ago

@brendancooley I've added the to_local_datetime functionality in polars-xdt: https://marcogorelli.github.io/polars-xdt-docs/api/polars_xdt.ExprXDTNamespace.to_local_datetime.html, that would solve your usecase :D There is also a from_local_datetimeif you want to roundtrip

Also was wondering how progress was going of the migration? πŸ‘€ I honestly can't wait starting to use this hehe, it would allow me to make more flexible patito models than currently possible

brendancooley commented 8 months ago

@brendancooley I've added the to_local_datetime functionality in polars-xdt: https://marcogorelli.github.io/polars-xdt-docs/api/polars_xdt.ExprXDTNamespace.to_local_datetime.html, that would solve your usecase :D There is also a from_local_datetimeif you want to roundtrip

Also was wondering how progress was going of the migration? πŸ‘€ I honestly can't wait starting to use this hehe, it would allow me to make more flexible patito models than currently possible

Hoping to carve out time later this week to finish this up! Just need to polish and follow up on @thomasaarholt's notes from above and I think we'll have something ready to build upon.

ion-elgreco commented 8 months ago

@brendancooley alright, let me know if you need help with testing

brendancooley commented 7 months ago

docstrings and new readme to come. but lean, mean, db-free patito is pushed. I'll work on getting summaries of functionality onto the readme so that the changes explain themselves. But in the interim @ion-elgreco would love to have you take it for a spin and let me know how it looks.

ion-elgreco commented 7 months ago

@brendancooley πŸš€πŸš€, I'm going to try it out now

ion-elgreco commented 7 months ago

@brendancooley datetime doesn't work if the type doesn't contain a TZ with .examples():

This will fix it:

if dtype.time_zone is not None:
    tzinfo = ZoneInfo(dtype.time_zone)
else:
    tzinfo = None
return datetime(
    year=1970, month=1, day=1, tzinfo=tzinfo
)
ion-elgreco commented 7 months ago

Adding dtype in pt.Field(dtype=) breaks validation for datetimes:

class Test(pt.Model):
    date_value: datetime = pt.Field(dtype=pl.Datetime)

Test.validate(pl.DataFrame({"date_value": [datetime(2020,1,1)]}))

DataFrameValidationError: 1 validation error for Test
date_value
  Polars dtype Datetime(time_unit='us', time_zone=None) does not match model field type. (type=type_error.columndtype)

While this works:

class Test(pt.Model):
    date_value: datetime

Test.validate(pl.DataFrame({"date_value": [datetime(2020,1,1)]}))
ion-elgreco commented 7 months ago

GE constraint not respected while creating examples:

class Test(pt.Model):
    value: int = pt.Field(ge=0)

print(Test.examples())
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ value β”‚
β”‚ ---   β”‚
β”‚ i64   β”‚
β•žβ•β•β•β•β•β•β•β•‘
β”‚ -1    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜

This should return 0 or larger :) 

Good to mention this only happens when ge=0, otherwise it works fine.

LOL, python evalautes 0.0 or None as None, while None or 0.0 is 0.0

@brendancooley this will fix it:

minimum = properties.get('minimum')
exclusive_minimum = properties.get("exclusiveMinimum")
maximum = properties.get('maximum')
exclusive_maximum = properties.get("exclusiveMaximum")            

lower = minimum if minimum is not None else exclusive_minimum
upper = maximum if maximum is not None else exclusive_maximum
ion-elgreco commented 7 months ago

So, besides these small bugs everything works as expected!

brendancooley commented 7 months ago

awesome @ion-elgreco, should be good to go on all of these now

ion-elgreco commented 7 months ago

@brendancooley awesome! Then I think it's in a state to release now☺️

brendancooley commented 7 months ago

Left a couple small comments, really nice work @brendancooley!

Hope we can get this merged soon @thomasaarholt :)

Useful for knowing where to flesh out the docs -- thanks!

ion-elgreco commented 6 months ago

Awesome! @thomasaarholt can't wait for the release :D

thomasaarholt commented 6 months ago

Doing it right now

thomasaarholt commented 6 months ago

Released now on github, should be on pypi shortly!

thomasaarholt commented 6 months ago

I see that the pypi release action failed, because everything has to pass. 😬 I'll see if I can tone down the requirements this evening.

thomasaarholt commented 6 months ago

We are live now with 0.6.1! https://pypi.org/project/patito/

ion-elgreco commented 6 months ago

Woohoo πŸŽ‰πŸŽ‰

brendancooley commented 6 months ago

Awesome, thanks @thomasaarholt. Excited to do more with this, glad that we have a stable foundation to build on top of.