Closed brendancooley closed 5 months ago
I'll build this one locally and see if it will fix the other bugs Patito 0.50 still has
They were also working on v2 here: https://github.com/JakobGM/patito/tree/jakobgm%2Fpatito-v2
I'll build this one locally and see if it will fix the other bugs Patito 0.50 still has
Anxious to hear how it goes
@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 https://github.com/JakobGM/patito/pull/20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.
One thing that still will be broken is the examples creation, you need to include this fix https://github.com/JakobGM/patito/pull/20 and then build on top of it since now a type which is Optional[str]
will have the parameter minimum behind a list in the anyOf
key.
I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)
@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.
One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is
Optional[str]
will have the parameter minimum behind a list in theanyOf
key.I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)
happy to include the test. did you mean to reference a different issue/pr in your second comment? not clear on how #20 helps with example string generation but perhaps I am missing something.
@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes.
One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is
Optional[str]
will have the parameter minimum behind a list in theanyOf
key.I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)
happy to include the test. did you mean to reference a different issue/pr in your second comment? not clear on how #20 helps with example string generation but perhaps I am missing something.
@brendancooley tried it on our repo and tests, I am also getting this now static type hint error:
Expression of type "FieldInfo" cannot be assigned to declared type "int"
"FieldInfo" is incompatible with "int"
on this code:
class Test(pt.Model):
test: int= pt.Field(constraints=pl.struct("a", "b").is_unique())
Also nullable columns are not working, mre:
class Test(pt.Model):
foo: str|None = pt.Field(dtype=pl.Utf8)
print(Test.nullable_columns)
set()
Result should be: {'foo'}
Thanks @ion-elgreco, I'll add these tests to the PR and start hacking at them tomorrow. Have not looked much into structs but it would be nice if we had more tests for those.
Thanks @ion-elgreco, I'll add these tests to the PR and start hacking at them tomorrow. Have not looked much into structs but it would be nice if we had more tests for those.
This example where I am using a struct is just a simple trick to do multiple column uniqueness as a constraint.
One thing that is missing though in Patito but probably out of scope for the migration is proper struct support as a dtype
Also nullable columns are not working, mre:
class Test(pt.Model): foo: str|None = pt.Field(dtype=pl.Utf8) print(Test.nullable_columns) set()
Result should be:
{'foo'}
@ion-elgreco should be addressed by https://github.com/JakobGM/patito/pull/32/commits/e2bf0d7351ca8750affcbae8f65ce0e48357dcd5
https://github.com/JakobGM/patito/pull/32/commits/017c59b2373705059b951b6a22f8f6dd787c851a also adds some sanity checking that the annotated types match the allowable polars types specified in Field.dtype
@brendancooley tried it on our repo and tests, I am also getting this now static type hint error:
Expression of type "FieldInfo" cannot be assigned to declared type "int" "FieldInfo" is incompatible with "int"
on this code:
class Test(pt.Model): test: int= pt.Field(constraints=pl.struct("a", "b").is_unique())
yes I can replicate these in vscode if I enable
{
"python.analysis.typeCheckingMode": "basic"
}
in settings.json. Looking into it.
@brendancooley tried it on our repo and tests, I am also getting this now static type hint error:
Expression of type "FieldInfo" cannot be assigned to declared type "int" "FieldInfo" is incompatible with "int"
on this code:
class Test(pt.Model): test: int= pt.Field(constraints=pl.struct("a", "b").is_unique())
This should be handled by https://github.com/JakobGM/patito/pull/32/commits/161300b80c1f898d7246b46bd0a918e6ab818426
@brendancooley nice work! The FieldInfo class nicely fixes the issue with properties being lost after any class method, I solved a bit a different with v1 #20. Perhaps you can grab my test from there and include it in your pr. I already checked the test also passes with your changes. One thing that still will be broken is the examples creation, you need to include this fix #20 and then build on top of it since now a type which is
Optional[str]
will have the parameter minimum behind a list in theanyOf
key. I will run it tomorrow on our production repo to see if anything breaks : ) (possibly in the examples creation it will)happy to include the test. did you mean to reference a different issue/pr in your second comment? not clear on how #20 helps with example string generation but perhaps I am missing something.
20 is a fix regarding, creating proper examples with numerical values. I meant just taking that fix directly into pydantic v2 won't work. I saw that grabbing the minimum field is now nested if you have a field that is optional
Yes this makes sense. Let me know if https://github.com/JakobGM/patito/pull/32/commits/9e132bc49baebaf3cf874c3469e61c811df88b74 fixes the issue @ion-elgreco.
I see more bugs at the moment :) with the types it's checking for and the examples it's creating:
examples ignores the dtype when it's an optional field, see test below:
class Test(pt.Model):
foo: int | None = pt.Field(dtype=pl.UInt32)
assert pl.DataFrame(schema={'foo':pl.UInt32}).schema == Test.examples().schema
Also, this seems executes while it shouldn't:
class Test(pt.Model):
foo: str | None = pt.Field(dtype=pl.UInt32)
Lastly with datetimes the mapping is not correct:
class Test(pt.Model):
date: datetime = pt.Field(dtype=pl.Datetime(time_unit="us"))
`ValueError: Invalid dtype Date for column 'date'. Check that specified dtype is allowable for the given type annotations.`
🐞👟 @ion-elgreco keep em coming, hopefully https://github.com/JakobGM/patito/pull/32/commits/93e171e3424f39269ccabbc6b31e00219ed77bd0 addresses with better logic for column nullability
@brendancooley nice, all my original unit-tests are running now without any issue from what I can see: )
Also very nice to see this issue will also get resolved now: https://github.com/JakobGM/patito/issues/11. I think you can add all these issues I've mentioned to the - closes in the pr description 🕺. The new version will be huge 😄
One suggestion on the dtype mismatches, it could be rewritten to this to clarify what the current data schema dtypes is and what the expected model dtype is.
The data field dtype Float32 does not match the model schema field dtype: int | None (type=type_error.columndtype)
The old one does not immediately give away what the expected data dtype should be.
foo
Polars dtype Float32 does not match model field type. (type=type_error.columndtype)
Actually found a couple bugs:
.examples()
does not work when you have this class:
class Test(pt.Model):
foo: list[str] | None = pt.Field(dtype=pl.List(pl.Utf8))
Test.examples({"foo":[['test']]})
# results into
TypeError: argument of type 'NoneType' is not iterable
This is also possible, but shouldn't be since the type annotation doesn't match the pt.Field dtype. It seems for any list type annotation there is no corresponding check with the patito dtype, I think you need do something with recursion here to check all of these nested complex types. Ideally you should be able to provide complex nested dtypes like: list[list[list[datetime]]]. Then this would also work later with pl.Stuct https://github.com/JakobGM/patito/issues/30 and then you can create really nice validation models.
class Test(pt.Model):
foo: list[str] | None = pt.Field(dtype=pl.Utf8)
I also noticed, the pt.Field docs are missing at the moment: Field.__doc__ = FieldDoc.__doc__
this does not seem to work properly. Also, I think it would make more sense if we can see the parameters in the function definition, otherwise there is no functioning auto completion and it's unclear (outside of reading the docs) which parameters are valid or not.
If you need any help, just let me know ;)
Also very nice to see this issue will also get resolved now: #11. I think you can add all these issues I've mentioned to the - closes in the pr description 🕺. The new version will be huge 😄
One suggestion on the dtype mismatches, it could be rewritten to this to clarify what the current data schema dtypes is and what the expected model dtype is.
The data field dtype Float32 does not match the model schema field dtype: int | None (type=type_error.columndtype)
The old one does not immediately give away what the expected data dtype should be.
foo Polars dtype Float32 does not match model field type. (type=type_error.columndtype)
Like this idea, cleaned up the messaging a bit on https://github.com/JakobGM/patito/pull/32/commits/e9dbca6b0d6986fc3b76bda074e26d099c291b1e
Should see errors messages like e.g.
"Invalid dtype UInt32 for column 'foo'. Allowable polars dtypes for Union[str, NoneType] are: Utf8."
@thomasaarholt also dropped the classproperties set via the ModelMetaclass
. Definitely cleaner than the metaclass+classvar formulation. Went back and forth a bit but decided to adopt polars' classproperty
decorator to try to avoid getting caught up in python debates about whether to preserve this functionality: https://stackoverflow.com/questions/76249636/class-properties-in-python-3-11
Test suite does fine with this but @ion-elgreco lmk if you run into issues.
@brendancooley I got lost a bit, what is the different way of approach class properties solving here?
@brendancooley I got lost a bit, what is the different way of approach class properties solving here?
The original implementation used a combination of ClassVar
and patito.pydantic.ModelMetaclass
to set class properties such as Model.columns
. Here's the original note:
# -- Class properties set by model metaclass --
# This weird combination of a MetaClass + type annotation
# in order to make the following work simultaneously:
# 1. Make these dynamically constructed properties of the class.
# 2. Have the correct type information for type checkers.
# 3. Allow sphinx-autodoc to construct correct documentation.
# 4. Be compatible with python 3.7.
# Once we drop support for python 3.7, we can replace all of this with just a simple
# combination of @property and @classmethod.
Since we're on python^3.8 now I think pattern can be safely deprecated. The natural way to do this is to use
class Model(BaseModel):
@classmethod
@property
def columns(cls):
...
however, it appears python does not plan on supporting this pattern in the future.
In our case, it's extremely useful to be able to mark classproperties this way, so I adopted polars' custom @classproperty
implementation in order for us to be able to replace the dual @classmethod
, @property
syntax with
class Model(BaseModel):
@classproperty
def columns(cls):
...
and not worry about what python does with @classmethod
and @property
in future versions
Actually found a couple bugs:
.examples()
does not work when you have this class:class Test(pt.Model): foo: list[str] | None = pt.Field(dtype=pl.List(pl.Utf8)) Test.examples({"foo":[['test']]}) # results into TypeError: argument of type 'NoneType' is not iterable
This is also possible, but shouldn't be since the type annotation doesn't match the pt.Field dtype. It seems for any list type annotation there is no corresponding check with the patito dtype, I think you need do something with recursion here to check all of these nested complex types. Ideally you should be able to provide complex nested dtypes like: list[list[list[datetime]]]. Then this would also work later with pl.Stuct #30 and then you can create really nice validation models.
class Test(pt.Model): foo: list[str] | None = pt.Field(dtype=pl.Utf8)
I also noticed, the pt.Field docs are missing at the moment:
Field.__doc__ = FieldDoc.__doc__
this does not seem to work properly. Also, I think it would make more sense if we can see the parameters in the function definition, otherwise there is no functioning auto completion and it's unclear (outside of reading the docs) which parameters are valid or not.If you need any help, just let me know ;)
I think the examples generation issue is fixed by https://github.com/JakobGM/patito/pull/32/commits/db44aa344edfa51fec9a9bc114a187b5e780f081
Field.__doc__
works ok for me? Do you have an example of where it's not appearing? I went ahead and put the patito-custom fields in the Field
signature, but left off the pydantic was. Idea being that we want this signature to be robust to possible changes in pydantic's Field
signature. Do we think version 2 is stable enough for us to copy their args over directly? I haven't followed closely followed the roadmap over there.
@brendancooley I got lost a bit, what is the different way of approach class properties solving here?
The original implementation used a combination of
ClassVar
andpatito.pydantic.ModelMetaclass
to set class properties such asModel.columns
. Here's the original note:# -- Class properties set by model metaclass -- # This weird combination of a MetaClass + type annotation # in order to make the following work simultaneously: # 1. Make these dynamically constructed properties of the class. # 2. Have the correct type information for type checkers. # 3. Allow sphinx-autodoc to construct correct documentation. # 4. Be compatible with python 3.7. # Once we drop support for python 3.7, we can replace all of this with just a simple # combination of @property and @classmethod.
Since we're on python^3.8 now I think pattern can be safely deprecated. The natural way to do this is to use
class Model(BaseModel): @classmethod @property def columns(cls): ...
however, it appears python does not plan on supporting this pattern in the future.
In our case, it's extremely useful to be able to mark classproperties this way, so I adopted polars' custom
@classproperty
implementation in order for us to be able to replace the dual@classmethod
,@property
syntax withclass Model(BaseModel): @classproperty def columns(cls): ...
and not worry about what python does with
@classmethod
and@property
in future versions
That makes sense! Thanks for explaining, wasn't aware of these differences.
Actually found a couple bugs:
.examples()
does not work when you have this class:class Test(pt.Model): foo: list[str] | None = pt.Field(dtype=pl.List(pl.Utf8)) Test.examples({"foo":[['test']]}) # results into TypeError: argument of type 'NoneType' is not iterable
This is also possible, but shouldn't be since the type annotation doesn't match the pt.Field dtype. It seems for any list type annotation there is no corresponding check with the patito dtype, I think you need do something with recursion here to check all of these nested complex types. Ideally you should be able to provide complex nested dtypes like: list[list[list[datetime]]]. Then this would also work later with pl.Stuct #30 and then you can create really nice validation models.
class Test(pt.Model): foo: list[str] | None = pt.Field(dtype=pl.Utf8)
I also noticed, the pt.Field docs are missing at the moment:
Field.__doc__ = FieldDoc.__doc__
this does not seem to work properly. Also, I think it would make more sense if we can see the parameters in the function definition, otherwise there is no functioning auto completion and it's unclear (outside of reading the docs) which parameters are valid or not. If you need any help, just let me know ;)I think the examples generation issue is fixed by db44aa3
Field.__doc__
works ok for me? Do you have an example of where it's not appearing? I went ahead and put the patito-custom fields in theField
signature, but left off the pydantic was. Idea being that we want this signature to be robust to possible changes in pydantic'sField
signature. Do we think version 2 is stable enough for us to copy their args over directly? I haven't followed closely followed the roadmap over there.
I don't have a reproducible example for that but it was when using Pyright and Pylance in VS code, the Field signature was empty. I checked your commit, but I don't think you will see the pydantic parameters now, such as (ge, le, lt, unique, pattern etc.). Ideally these are also visible in signature,
Also note to myself, I need stop quoting because the comments become immense long haha
I don't have a reproducible example for that but it was when using Pyright and Pylance in VS code, the Field signature was empty. I checked your commit, but I don't think you will see the pydantic parameters now, such as (ge, le, lt, unique, pattern etc.). Ideally these are also visible in signature,
Thinking on the best way to accomplish this. Would be nice if we could develop a nice framework for extending FieldInfo
(and updating the signature for Field
) for arbitrary sets of new arguments. This would make it easy for users to customize Field
as necessary for their specific use cases.
Would be a lot easier if pydantic's Field
was actually a class and not a function that spits back out a FieldInfo
.
Tried combining pydantic args and patito args like this
field_sig = inspect.signature(Field)
pyd_field_params = inspect.signature(fields.Field).parameters
Field.__signature__ = inspect.Signature(
parameters=[pyd_field_params["default"]]
+ [v for v in field_sig.parameters.values() if v.name != "kwargs"]
+ [v for v in pyd_field_params.values() if v.name not in ["default", "extra"]]
+ [field_sig.parameters["kwargs"]],
return_annotation=field_sig.return_annotation
)
but vscode does not seem to pick up on the signature update. I'm loath to just copy over the pydantic arguments because I suspect they will not be stable...many are already scheduled for deprecation.
https://github.com/JakobGM/patito/pull/32/commits/438974c807c3917da93f2b7264b5f9e4172af7f0 experiments with trying to make the FieldInfo
class more extensible, by allowing users to define extension classes (as pydantic Model
s) and generate the Field
function dynamically. I use this functionality to generate an extended FieldInfo
with patito custom fields, but demonstrate how it can be use to construct arbitrary FieldInfo
extensions in test_model.test_custom_field_info
. This makes it very easy for users to customize their own Field
objects, which is certainly useful for my use cases and has been requested by some pydantic users (see https://github.com/pydantic/pydantic/issues/6454). Since these classes are dynamically constructed however, it is very difficult to get the typing right and to generate a valid signature for the Field
function. I've resorted to manually typing PatitoFieldInfo
, which helps resolve the patito-custom attributes, but leaves the dynamically constructed Field
objects without good typing.
@brendancooley @thomasaarholt What is still required to push this over the edge?
I foresee I have more time next couple weeks to help here so let me know.
@brendancooley @thomasaarholt What is still required to push this over the edge?
I foresee I have more time next couple weeks to help here so let me know.
duckdb
support? @ion-elgreco maybe take a pass at this and see if it's easy to get the tests to pass again? If not, delete or deprecate existing code. fwiw I've been using duckdb to work with patito-modeled tables and haven't felt that explicit support is necessary.pydantic.field
. I'm inclined to leave this as is for now but open to suggestions for better usability.Model
/Field
) to better mirror pydantic internal structure@brendancooley I think patito should only focus on Polars for now, I doubt many people use it for the DuckDB functionality.
I think we should drop duckdb support too :)
@brendancooley could you clarify for me how users can extend their requirements using the new Field
/FieldInfo
? I'm a bit hesitant to introduce something that doesn't get a static docstring like Field
does currently. Doesn't the Field(constraints=[<some polars expressions>])
already let one add custom column requirements?
I think we should drop duckdb support too :)
@brendancooley could you clarify for me how users can extend their requirements using the new
Field
/FieldInfo
? I'm a bit hesitant to introduce something that doesn't get a static docstring likeField
does currently. Doesn't theField(constraints=[<some polars expressions>])
already let one add custom column requirements?
Currently, arguments not explicitly enumerated in pydantic's Field
factory get thrown into json_schema_extra
. This is an issue first and foremost because the feature is scheduled for deprecation. It is also a problem because pydantic expects json-serializable arguments. Polars Expr
are not currently serializable so this raises an error when we call model_json_schema
internally with constraints
or derived_from
expression args. See 28 for discussion on this.
A secondary issue is that pydantic does not make it easy to subclass FieldInfo
. They make use of __slots__
to explicitly enumerate it's attributes, and track values passed by users in _attributes_set
in order to create minimal representations of the field info object. All of the FieldInfo
attributes (except annotation
and required
) not passed by the user default to _Unset
and are ignored by the __repr__
constructor, which makes for nice pretty-printing of the field. For example, passing
from patito.pydantic import fields
f = fields.Field(ge=0)
f
returns FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0)])
without ancillary attribute_x=None
.
These constraints have admittedly made it difficult to cleanly add patito-specific attributes to the FieldInfo
objects. The solution here attempts to allow arbitrary extensions to the FieldInfo
object. Extension attributes are typed as pydantic BaseModel
instances. For patito
, this looks like
class PatitoFieldExtension(BaseModel):
"""Stores type hints and default values for patito-custom field attributes"""
constraints: pl.Expr | Sequence[pl.Expr] | None = _Unset
derived_from: str | pl.Expr | None = _Unset
dtype: PolarsDataType | None = _Unset
unique: bool | None = _Unset
model_config = ConfigDict(arbitrary_types_allowed=True)
The patito-side field_info
function takes a collection of extension classes and constructs a subclass of pydantic's FieldInfo
, with __slots__
updated with the new arguments from the extension classes. Extension attributes are set in the __init__
, where we also update the _attributes_set
attribute to reflect the addition of the new arguments.
Then, the field
function builds a factory for our custom FieldInfo
, splitting arguments into those that are pydantic-native and those that belong to extensions. pydantic-base arguments are passed via pydantic.Field
to ensure that pydantic-side argument processing occurs. Then we extract attributes that have been set on the pydantic FieldInfo
and pass them, along with our custom arguments, to the patito-side FieldInfo
.
The resulting API is equivalent to the previous version, albeit without nice typing
Field = field(exts=[PatitoFieldExtension])
Users can call this Field
function to generate patito-specific FieldInfo
s as they did before. tests.test_model.test_custom_field_info
enumerates some expectations for these custom FieldInfo
objects.
A nice property of this implementation is that users can now extend patito's FieldInfo
class by passing their own extensions to the patito.pydantic.field
factory. This should also help facilitate easy maintenance of the library as we add new features.
The big downside of all of this is that the typing is very difficult. I am very hesitant copy over pydantic's current Field
signature because many of the arguments are already scheduled for deprecation. Users running different versions of pydantic might have different effective signatures for patito.Field
. See, for example, min_items
, but there are many others. But it should be possible to collect arguments passed via the extension classes and those that are in the pydantic.Field
signature for user's current pydantic version and spit out a dynamic docstring. I experimented with this a bit but it was difficult to implement cleanly.
I'm happy to flesh out the existing documentation on these functions a bit and perhaps add examples. But having some discussion here to align on a vision for how this should work seems valuable. Very open to other ideas or feedback on my representation of how the pydantic-side stuff works. The pydantic folks do seem a little unwilling to support custom FieldInfo
objects so I worry our solution may need to be a little hacky if we intend on keeping the existing API, see this issue for the feature request.
Currently, arguments not explicitly enumerated in pydantic's Field factory get thrown into json_schema_extra. This is an issue first and foremost because the feature is scheduled for deprecation.
Actually, it's the extra
argument that is being deprecated, in favour json_schema_extra json_schema_extra
:)
Currently, arguments not explicitly enumerated in pydantic's Field factory get thrown into json_schema_extra. This is an issue first and foremost because the feature is scheduled for deprecation.
Actually, it's the
extra
argument that is being deprecated, in favour json_schema_extrajson_schema_extra
:)
Yes, but extra
aliases unenumerated keyword args, right? So in the future they (pydantic) plan on supporting extra keyword args only via json_schema_extra
. So if we plan were to adhere to these constraints we would need to ensure that our extra arguments are json-serializable.
My suggestion is going to be to use this approach of passing them from the patito Field
into the pydantic Field
, but I'm thinking through your comments now. I'm currently writing a longer post with a suggestion. Wanted to keep conversation going though, so wrote the above first.
This is the reason polars-serialization suddenly started throwing errors in pydantic btw. I think their changes made sense, but were unfortuante for us. https://github.com/pydantic/pydantic/pull/7625
So, both your and jakob/mine implementations currently go via the pydantic json interface (model_json_schema
) in order to do a lot of functionality enabled by patito. I propose that instead, we use the BaseModel.model_fields
interface, which exposes the fields information as a FieldInfo.
This means that we don't try to serialize any types, and can use pt.Field
in the approach I commented above.
We would still under the hood be passing non-jsonable content to json_schema_extra
, but while those arguments are intended to be json-serializable, I don't think they would try to serialize them unless explicitly called for.
I'm working in a notebook right now, going to try to at least share my approach tonight.
@brendancooley I remember @thomasaarholt created an upstream issue at Polars to make expressions serializable. I don't think this has been picked up yet.
I could take a look and see if I can work on that. I've been dangling with Rust myself in delta-rs and some plugins, so perhaps a fun challenge to build some rust code in Polars ^^
Actually nvm, it's already possible it seems :): https://github.com/pola-rs/polars/issues/12172
Hehe, since creating that issue, I've learnt two additional things:
.write_json()
, and b) can't just concatenate two json strings directly.To clarify on the second point, if we want to keep the JSON-approach, there are two places we might want the fallback mechanism to support serializing polars objects:
import polars as pl
from pydantic import BaseModel, Field
class Foo(BaseModel):
type_example: pl.Int64 # here as a type hint
extra_example: int = Field(json_schema_extra={"expression_example": pl.col("foo")}) # here as an extra
Foo.model_json_schema() # raises an error with either of the above commented-out
For the type hint, we would either have to:
field_serializer
(which I actually can't manage to get work), but then we would have to add one for every column of a given name.__get_pydantic_core_schema__
(example here).__get_pydantic_core_schema__
on the classes we want pydantic to serialize.For extras example, I think we can only achieve it with points 2 or 3.
I just want to add - I find the pydantic serialization documentation really complicated, and I'm struggling quite a bit with implementing it on a subclass of pl.Float64. My thoughts in the previous post might be wrong, and I'd be very happy to be corrected.
@thomasaarholt Catching up and checking my understanding...if we want to pass and store patito-specific field attributes via pydantic's json_schema_extra
, we need to either:
model_json_schema
on the patito side and warn users that this is not a supported attribute for patito models (or possibly implement our own schema representation) OROn the question of deserialization...suppose we can serialize dtypes and expressions, do we need to be able to go backward from json -> python object in order for this to approach to work as well? Is this supported by polars?
Could we use the fallback
kwarg in pydantic's to_jsonable_python
to pass serialization instructions for these? We'd need to allow this keyword to be passed around during schema construction on the pydantic side but maybe that's more flexible then working with the field_serializer approach.
Also @thomasaarholt if your notebook scratch work is able to be shared I'd be happy to take a look and try to get up to speed on your approach.
Yes! I've just gone over it again, here it is: https://gist.github.com/thomasaarholt/c04ae8ad503a1476e5282e7098eeb1f7
I've called the result for each column a ColumnInfo
class. I just needed a name that didn't have Field in it, cause I was going a little crazy 😅 .
Catching up and checking my understanding...if we want to pass and store patito-specific field attributes via pydantic's json_schema_extra, we need to either:
- remove all references to model_json_schema on the patito side and warn users that this is not a supported attribute for patito models (or possibly implement our own schema representation) OR
- support serialization of polars expressions and dtypes, either by implementing pydantic's field_serializer or by getting polars to support it natively
Yep! But on 2, field_serializer
is not the way to go, it was just me trying to make pydantic somehow serialize a polars type. (I didn't manage it)
I don't think we need patito models to be json serializable. At least, I haven't been doing that, though I guess it might be nice to be able to save the patito schema as a json? In that case we could just add our own to_json
and from_json
methods. We can just convert a pl.Float64
to a "PolarsFloat64"
string via a simple mapping, and then convert to and from like that.
I see that polars has added Expr.from_json
, so that can read back in the result of Expr.write_json
. We can probably use exactly these methods.
Came across an issue with write_json
worth discussing here. polars expression serialization (perhaps unsurprisingly) is not supported for maps over python callables. For example, the following code:
from typing import Dict, datetime
import polars as pl
def convert_time_zone(struct: Dict) -> datetime:
if struct["timezone_id"] is None:
return None
return (
struct["datetime_utc"]
.astimezone(pytz.timezone(struct["timezone_id"]))
.replace(tzinfo=None)
)
expr = pl.struct(["datetime_utc", "timezone_id"]).map_elements(convert_timezone)
expr.meta.write_json(None)
raises
ValueError: Error("error raised in python: AttributeError: Can't pickle local object 'Expr.map_elements.<locals>.wrap_f'", line: 0, column: 0)
I suspect this is a fairly common pattern for patito users. Minimal patito model using the above expression looks like:
import patito as pt
from pydantic import AwareDatetime
class DatetimeModel(pt.Model):
datetime_utc: AwareDatetime = pt.Field(dtype=pl.Datetime(time_zone="UTC"))
timezone_id: str | None
datetime_local: datetime | None = pt.Field(
default=None,
derived_from=pl.struct(["datetime_utc", "timezone_id"]).map_elements(
convert_time_zone
),
dtype=pl.Datetime,
)
@thomasaarholt any thoughts on how we ought to handle this case?
@brendancooley I honestly don't think this is a common pattern. You can do nearly anything with the Polars API or you use a plugin.
@brendancooley I honestly don't think this is a common pattern. You can do nearly anything with the Polars API or you use a plugin.
@ion-elgreco yeah I may just not be polars savvy enough to do this the "right" way. do you have a suggestion for how to implement the above expression in pure polars?
@brendancooley I can take a look if you have some sample data
@brendancooley I can take a look if you have some sample data
@ion-elgreco sure, how about this
from typing import Dict
from datetime import datetime
import pytz
import polars as pl
def convert_time_zone(struct: Dict) -> datetime:
if struct["timezone_id"] is None:
return None
return (
struct["datetime_utc"]
.astimezone(pytz.timezone(struct["timezone_id"]))
.replace(tzinfo=None)
)
df = pl.DataFrame({"datetime_utc": [datetime.now(), datetime.now()], "timezone_id": ["America/New_York", None]})
df = df.with_columns(pl.col("datetime_utc").cast(pl.Datetime(time_zone="UTC")), pl.col("timezone_id"))
df.with_columns(datetime_local=pl.struct(["datetime_utc", "timezone_id"]).map_elements(convert_time_zone)) # current solution
builds upon #4, *all existing tests (excepting `tests/test_duckdb/
) passing after upgrade to
pydantic==2.4.2,
polars==0.19.11`**Closes #11 Closes #28 Closes #26 ...possibly closes others
Summary
ValidationError
from pydantic v1 to allow for multiple error reporting, collect data frame validation errors in this object, rename asDataFrameValidationError
to distinguish from errors raised by pydantic directly.pydantic.fields.FieldInfo
to accommodate patito-specific field attributes (constraints
,derived_from
,dtype
,unique
)patito.Model._schema_properties
to append patito-specific field attributes to the field property dictionaries topydantic.BaseModel.model_json_schema
json_schema_extra
required
andnullable
.nullable
now refers to columns that allowpl.Null
as a valid entry.required
(pydantic-side) refers to columns that do not have a default value, and therefore must be passed to theBaseModel
constructor.polars.collect
bug (see #24)TODO
patito.pydantic
And I'm sure there are other issues introduced by these changes, and bugs that the existing test suite does not yet catch. But hopefully this serves as a template for discussion and new test collection and can help move the ball forward on getting this package onto pydantic2. Feedback very welcome.