JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
270 stars 23 forks source link

Very rough draft for pydantic v2 #4

Closed thomasaarholt closed 10 months ago

thomasaarholt commented 1 year ago
ion-elgreco commented 1 year ago

@thomasaarholt What else needs to be done to push this over the edge?

ion-elgreco commented 1 year ago

@JakobGM @thomasaarholt Polars just bumped the minimum version for pydantic to v2.0 https://github.com/pola-rs/polars/pull/10923. It would be nice if we can also get pydantic 2.0 support in patito, otherwise users are stuck on Polars 0.19.1.

thomasaarholt commented 1 year ago

I completely agree. Jakob and I are both on very full schedules at the moment and are trying to find time to finish off the work that we started in this branch: https://github.com/JakobGM/patito/tree/jakobgm/patito-v2

ion-elgreco commented 1 year ago

I see! If there are some small things I could help out with; I can try to familiarize myself more with the codebase and make some additions, but I probably would need some pointers.

stinodego commented 1 year ago

@JakobGM @thomasaarholt Polars just bumped the minimum version for pydantic to v2.0 pola-rs/polars#10923. It would be nice if we can also get pydantic 2.0 support in patito, otherwise users are stuck on Polars 0.19.1.

We did not realize this dependency clash with patito. We'll revert this change for now, so you'll be able to use 0.19.3 and later.

thomasaarholt commented 1 year ago

Thanks @stinodego. Hopefully won't be much longer before we bump it ourselves.

brendancooley commented 10 months ago

If my understanding is correct, a major outstanding issue is the forced json conversion of pydantic v2 BaseModel schemas via model_json_schema. This new interface to the Field metadata, combined with removing support for arbitrary extensions to the Field attributes, breaks patito's interface for specifying column dtypes.

pt.Field(dtype=pl.UInt64)

now raises

warnings.PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'dtype').

Attempting to modernize to

pt.Field(json_schema_extra={"dtype":pl.UInt64})

raises

pydantic_core._pydantic_core.PydanticSerializationError: Unable to serialize unknown type: <class 'polars.datatypes.classes.DataTypeClass'>

when the user calls BaseModel.model_json_schema(), which is the v2 accessor for the field properties.

One possibility would be to specify the dtypes with string equivalents, but polars does not seem to support this style. str(pl.UInt64) -> "Int64" but they do not seem to have implemented an inverse map.

Another possibility would be to subclass pydantic.Field, add dtype as a supported attribute, and add handlers for converting back and forth between polars DataTypes and string representations of those. This would have the added advantage of allowing patito to specify other attributes (unique seems like another one that might be good to add as a typed field), instead of hacking the functionality via the json_schema_extra dict.


Using the example in test_dataframe_model_dtype_casting (test_polars):

class DTypeModel(pt.Model):
    implicit_int_1: int
    implicit_int_2: int
    explicit_uint: int = pt.Field(json_schema_extra={"dtype":pl.UInt64})
    implicit_date: date
    implicit_datetime: datetime

Calling DTypeModel.model_fields['explicit_uint'] returns FieldInfo(annotation=int, required=True, json_schema_extra={'dtype': UInt64}) no problem. So perhaps in the short term we could avoid calls to model_json_schema and use the model_fields attribute instead?

ion-elgreco commented 10 months ago

@brendancooley with regards to the datatype serialisation not working is something which needs to be done upstream. I think it's just a matter of using Serde in Polars.

API wise it would be less ideal to have to pass all these custom fields to json_schema_extra. The Patito API should handle that.

brendancooley commented 10 months ago

@brendancooley with regards to the datatype serialisation not working is something which needs to be done upstream. I think it's just a matter of using Serde in Polars.

API wise it would be less ideal to have to pass all these custom fields to json_schema_extra. The Patito API should handle that.

Very much agree that patito custom fields ought to be exposed directly. Is there an open issue for polars dtype serialization?

ion-elgreco commented 10 months ago

@brendancooley I don't think so, can you create one in Polars?

A while ago I created one for pl.struct and it was added within 2 days.

If @thomasaarholt @JakobGM could share what perhaps is still outstanding then I think the community can contribute to those things to speed up the migration :)

brendancooley commented 10 months ago

@brendancooley I don't think so, can you create one in Polars?

A while ago I created one for pl.struct and it was added within 2 days.

If @thomasaarholt @JakobGM could share what perhaps is still outstanding then I think the community can contribute to those things to speed up the migration :)

Looks like we would also need them to support the serialization of Expressions, to make the derived_from field work properly. The test derive model currently fails when you call model_json_schema on it

class DerivedModel(pt.Model):
    underived: int
    const_derived: int = pt.Field(json_schema_extra={"derived_from": pl.lit(3)})
    column_derived: int = pt.Field(json_schema_extra={"derived_from": "underived"})
    expr_derived: int = pt.Field(json_schema_extra={"derived_from": 2 * pl.col("underived")})
    second_order_derived: int = pt.Field(json_schema_extra={"derived_from": 2 * pl.col("expr_derived")})

with

pydantic_core._pydantic_core.PydanticSerializationError: Unable to serialize unknown type: <class 'polars.expr.expr.Expr'>
brendancooley commented 10 months ago

Working on collecting a list of sub-issues for this on #28

brendancooley commented 10 months ago

Took a hack at finishing this up on #32. Feedback from existing users would be very helpful. In particular if folks could add tests that isolate issues introduced by upgrading onto this prototype would be very helpful.

thomasaarholt commented 10 months ago

Superseded by either #32 or #33, or at least a combination 😊