JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
252 stars 23 forks source link

Refactor of Field / FieldCI / ColumnInfo #63

Open thomasaarholt opened 4 months ago

thomasaarholt commented 4 months ago

@brendancooley (and others), I want to refactor the Field function in order to:

The suggestion

I'm considering encoding all the relevant Field parameters inside ColumnInfo, and ensuring that ColumnInfo can serialize (using field_serializer) and deserialize (through validators) all parameters.

This means that we can type-safely pass all these arguments to pydantic's Field using pydantic.fields.Field(json_schema_extra=column_info.model_dump()).

Then, at validation time, we would reconstruct the ColumnInfo object for each column using ColumnInfo.model_validate(some_patito_model.model_fields["some_field"]), and be able to relatively easily use these objects for validation.

However

The only things that I'm a bit unsure about:

Let me know if this seems unclear! Writing this while alternating entertaining a 2.5 year old and a 3 month old 😅

brendancooley commented 4 months ago

Certainly agree that this interface could use a refactor. A few concerns about replicating pydantic args onto ColumnInfo (or another patito-side interface):

  1. keeping up with pydantic's signature will likely require some maintenance work, and it's possible that the signature might change on a minor version. Some pydantic field args (e.g. max_items) are scheduled for deprecation. How do we intend to handle these?
  2. Which pydantic metadata does patito intend to support? Should we compile some of the constraints to polars expressions and append them to constraints (e.g. Ge(0) -> pl.col("foo") >= 0). How should patito handle strict?

Overall, being able to take an existing pydantic model and quickly it into a patito model while retaining object-level validation from pydantic is a very nice feature. But not having our own Field definition makes us reactive to changes on the pydantic side.

Maybe we should start by defining more concretely what constitutes a patito Field (i.e. which elements are required to perform tabular validation and schema specification), and then we can work on the conversion/serialization of a pydantic Field to a patito Field.

pydantic's FieldInfo, for reference: https://github.com/pydantic/pydantic/blob/8aeac1a4c61b084ebecf61b38bb8d3e80884dc33/pydantic/fields.py#L89

My one year old is napping and giving me a chance to catch up on all of this great work and thinking! :)