Add field validation on structs

dsgibbons commented 4 months ago

Closes #43.

This fix takes advantage of the existing validation logic by recursively calling _find_errors whenever pl.Struct or pl.List(pl.Struct) columns are detected.

This fix exposes a (possible) bug with one of the pre-existing test cases test_model.test_missing_date_struct. I've currently added a @pytest.mark.skip to the failing test. Is this actually a bug, and if so, can I continue to skip this test case and raise a new issue to fix it? Do you have any insights into why Test.model_fields['c'].annotation.columns does not exist?

thomasaarholt commented 4 months ago

I think this is a seriously elegant solution, and I really hope we can make it work for the general case.

I think the skipped test error is related to the following example:

Consider a nullable struct column:

import patito as pt

class XY(pt.Model):
    x: int
    y: int

class Coord(pt.Model):
    id: int
    xy: Optional[XY] # <-- nullable

What we are saying here is that we allow the following values for xy:

{"x":1, y:1}
null

But not

{"x":1, y:null}

Now let's take a look at how polars materializes the above three column values:

import polars as pl

schema = {"id": pl.Int64, "xy": pl.Struct({"x": pl.Int64, "y": pl.Int64})}
# note the difference between the two last elements in the struct:
data = {"id": [0, 1, 2], "xy": [{"x": 1, "y": 1}, None, {"x": None, "y": None}]}

df = pl.DataFrame(data, schema=schema)
print(df)
---
shape: (3, 2)
┌─────┬─────────────┐
│ id  ┆ xy          │
│ --- ┆ ---         │
│ i64 ┆ struct[2]   │
╞═════╪═════════════╡
│ 0   ┆ {1,1}       │
│ 1   ┆ {null,null} │
│ 2   ┆ {null,null} │
└─────┴─────────────┘

I think I was expecting null for the second row, that is, looking only at the last two rows:

shape: (3, 2)
┌─────┬─────────────┐
│ id  ┆ xy          │
│ --- ┆ ---         │
│ i64 ┆ struct[2]   │
╞═════╪═════════════╡
│ 1   ┆ null        │
│ 2   ┆ {null,null} │
└─────┴─────────────┘

I'm going to have to have a little think about how we do this one 😊

dsgibbons commented 4 months ago

@thomasaarholt I've updated the PR to correctly select Type[Model] when dealing with Optional and have applied filters on all-null subrows to encode the nullable requirement, as discussed above.

Since polars coerces None -> {null, null, ...} when dealing with structs, I think this fix is valid. The only way I can see a more "legitimate" way of implementing this would be if polars itself either:

Provided a pl.Optional datatype for marking structs as optional.
Changed its default coercion logic for handling None in the context of structs. Like @thomasaarholt, I was surprised to discover how polars currently coerces None for structs, as every other datatype I have come across (including pl.List) is simply encoded as null... there must be some good reason for doing this that is unlikely to change.

I'll likely update this PR to include a few more test cases for testing optional lists of structs and optional deeply nested structs, but I don't think the current fix will need to change. What do you think?

thomasaarholt commented 4 months ago

Super! I overall agree with you - I've asked here on the discord just to make sure the implementation is correct. This is kinda an edge case, but someone might have an opinion.

chainyo commented 3 months ago

Hi any news on this PR?

thomasaarholt commented 3 months ago

Let's go ahead and merge this, as well make a new release. @dsgibbons, if you want to make a few more test cases, especially demonstrating null behaviour, in another PR, then I'd appreciate it.

dsgibbons commented 3 months ago

Yes, I'm still looking at adding additional test cases @thomasaarholt. I've been a bit busy lately but hopefully I'll be able to add more test cases over the next few weeks. Thanks for accepting the PR!

JakobGM / patito

Add field validation on structs #52