JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
252 stars 23 forks source link

Bug: Field constraints not evaluated on structs #43

Closed dsgibbons closed 3 months ago

dsgibbons commented 5 months ago

This project looks really interesting, and I look forward to seeing how it develops. The struct support seems quite good, but it doesn't seem to support field constraints at the moment. Here's an example:

import patito as pt
import polars as pl

class Struct(pt.Model):
    x: int
    y: int
    z: int = pt.Field(lt=2)

class MyModel(pt.Model):
    struct_col: Struct
    list_struct_col: list[Struct]

df = pl.DataFrame(
    {
        "struct_col": [{"x": 1, "y": 2, "z": 3}],
        "list_struct_col": [[{"x": 1, "y": 2, "z": 3}, {"x": 1, "y": 2, "z": 3}]],
    }
)

MyModel.validate(df)

This passes validation, despite all of the z fields being greater than 2. I expect to receive errors for both of the columns. Can pt.Field be improved to do more validation inside of structs?

dsgibbons commented 5 months ago

Here's another example based on the constraints kwarg:

import patito as pt
import polars as pl

class Points(pt.Model):
    x_min: int
    x_max: int = pt.Field(constraints=pt.col("x_min") < pt.col("x_max"))

class MyModel(pt.Model):
    points: Points

df = pl.DataFrame({"points": [{"x_min": 3, "x_max": 2}]})

MyModel.validate(df)
dsgibbons commented 4 months ago

I'm not quite sure what has changed since I first wrote this issue, but it's even worse now. If a model contains a struct with a field constraint, for example:

class Struct(pt.Model):
    x: int
    y: int
    z: int = pt.Field(lt=2)

class MyModel(pt.Model):
    struct_col: Struct
    list_struct_col: list[Struct]

The following TypeError is raised:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/.../src/patito/pydantic.py", line 96, in __init__
    cls.DataFrame = DataFrame._construct_dataframe_model_class(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../src/patito/polars.py", line 361, in _construct_dataframe_model_class
    f"{model.model_json_schema()['title']}DataFrame",
       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../venv/lib/python3.11/site-packages/pydantic/main.py", line 385, in model_json_schema
    return model_json_schema(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/.../venv/lib/python3.11/site-packages/pydantic/json_schema.py", line 2153, in model_json_schema
    return schema_generator_instance.generate(cls.__pydantic_core_schema__, mode=mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../venv/lib/python3.11/site-packages/pydantic/json_schema.py", line 427, in generate
    definitions_remapping = self._build_definitions_remapping()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../venv/lib/python3.11/site-packages/pydantic/json_schema.py", line 2105, in _build_definitions_remapping
    return _DefinitionsRemapping.from_prioritized_choices(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../venv/lib/python3.11/site-packages/pydantic/json_schema.py", line 174, in from_prioritized_choices
    schemas_for_alternatives[defs_ref] = _deduplicate_schemas(schemas_for_alternatives[defs_ref])
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../venv/lib/python3.11/site-packages/pydantic/json_schema.py", line 2207, in _deduplicate_schemas
    return list({_make_json_hashable(schema): schema for schema in schemas}.values())
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../venv/lib/python3.11/site-packages/pydantic/json_schema.py", line 2207, in <dictcomp>
    return list({_make_json_hashable(schema): schema for schema in schemas}.values())
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'ColumnInfo'

Before, the field constraint would simply be ignored. Now, the field constraint prevents the model from even being defined.

dsgibbons commented 4 months ago

Ok, so it appears the problem from my last comment was due to a Pydantic versioning issue. When I first raised this issue, I was using pydantic>=2.5. However, this caused the unit tests to break. For example, when running tests against test_validators.py with pydantic>=2.5:

FAILED tests/test_validators.py::test_custom_constraint_validation - pydantic_core._pydantic_core.PydanticSerializationError: Error calling function `serialize_exprs`...
FAILED tests/test_validators.py::test_anonymous_column_constraints - pydantic_core._pydantic_core.PydanticSerializationError: Error calling function `serialize_exprs`...
FAILED tests/test_validators.py::test_nested_field_attrs - pydantic_core._pydantic_core.PydanticSerializationError: Error calling function `serialize_exprs`...

To get around this issue, I downgraded Pydantic to 2.4.0 in my local dev environment. This fixes the unit tests, but leads to the TypeError from my previous comment. I'll revert to using pydantic>=2.5, and will raise a separate issue to fix the above unit tests. Upon resolving this issue, we should ensure that pydantic>=2.5 is mandated in the dependencies.

thomasaarholt commented 4 months ago

I've just merged #51 which fixes the PydanticSerializationErrors that you commented :)

Your struct example does not error out now, but I haven't yet done anything to enforce validation.

dsgibbons commented 4 months ago

Fantastic! Thanks for that.