JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
252 stars 23 forks source link

Add field validation on structs #52

Closed dsgibbons closed 3 months ago

dsgibbons commented 4 months ago

Closes #43.

This fix takes advantage of the existing validation logic by recursively calling _find_errors whenever pl.Struct or pl.List(pl.Struct) columns are detected.

This fix exposes a (possible) bug with one of the pre-existing test cases test_model.test_missing_date_struct. I've currently added a @pytest.mark.skip to the failing test. Is this actually a bug, and if so, can I continue to skip this test case and raise a new issue to fix it? Do you have any insights into why Test.model_fields['c'].annotation.columns does not exist?

thomasaarholt commented 4 months ago

I think this is a seriously elegant solution, and I really hope we can make it work for the general case.

I think the skipped test error is related to the following example:

Consider a nullable struct column:

import patito as pt

class XY(pt.Model):
    x: int
    y: int

class Coord(pt.Model):
    id: int
    xy: Optional[XY] # <-- nullable

What we are saying here is that we allow the following values for xy:

But not

Now let's take a look at how polars materializes the above three column values:

import polars as pl

schema = {"id": pl.Int64, "xy": pl.Struct({"x": pl.Int64, "y": pl.Int64})}
# note the difference between the two last elements in the struct:
data = {"id": [0, 1, 2], "xy": [{"x": 1, "y": 1}, None, {"x": None, "y": None}]}

df = pl.DataFrame(data, schema=schema)
print(df)
---
shape: (3, 2)
┌─────┬─────────────┐
│ id  ┆ xy          │
│ --- ┆ ---         │
│ i64 ┆ struct[2]   │
╞═════╪═════════════╡
│ 0   ┆ {1,1}       │
│ 1   ┆ {null,null} │
│ 2   ┆ {null,null} │
└─────┴─────────────┘

I think I was expecting null for the second row, that is, looking only at the last two rows:

shape: (3, 2)
┌─────┬─────────────┐
│ id  ┆ xy          │
│ --- ┆ ---         │
│ i64 ┆ struct[2]   │
╞═════╪═════════════╡
│ 1   ┆ null        │
│ 2   ┆ {null,null} │
└─────┴─────────────┘ 

I'm going to have to have a little think about how we do this one 😊

dsgibbons commented 4 months ago

@thomasaarholt I've updated the PR to correctly select Type[Model] when dealing with Optional and have applied filters on all-null subrows to encode the nullable requirement, as discussed above.

Since polars coerces None -> {null, null, ...} when dealing with structs, I think this fix is valid. The only way I can see a more "legitimate" way of implementing this would be if polars itself either:

I'll likely update this PR to include a few more test cases for testing optional lists of structs and optional deeply nested structs, but I don't think the current fix will need to change. What do you think?

thomasaarholt commented 4 months ago

Super! I overall agree with you - I've asked here on the discord just to make sure the implementation is correct. This is kinda an edge case, but someone might have an opinion.

chainyo commented 3 months ago

Hi any news on this PR?

thomasaarholt commented 3 months ago

Let's go ahead and merge this, as well make a new release. @dsgibbons, if you want to make a few more test cases, especially demonstrating null behaviour, in another PR, then I'd appreciate it.

dsgibbons commented 3 months ago

Yes, I'm still looking at adding additional test cases @thomasaarholt. I've been a bit busy lately but hopefully I'll be able to add more test cases over the next few weeks. Thanks for accepting the PR!