Closed dsgibbons closed 3 months ago
I think this is a seriously elegant solution, and I really hope we can make it work for the general case.
I think the skipped test error is related to the following example:
Consider a nullable struct column:
import patito as pt
class XY(pt.Model):
x: int
y: int
class Coord(pt.Model):
id: int
xy: Optional[XY] # <-- nullable
What we are saying here is that we allow the following values for xy
:
{"x":1, y:1}
null
But not
{"x":1, y:null}
Now let's take a look at how polars materializes the above three column values:
import polars as pl
schema = {"id": pl.Int64, "xy": pl.Struct({"x": pl.Int64, "y": pl.Int64})}
# note the difference between the two last elements in the struct:
data = {"id": [0, 1, 2], "xy": [{"x": 1, "y": 1}, None, {"x": None, "y": None}]}
df = pl.DataFrame(data, schema=schema)
print(df)
---
shape: (3, 2)
┌─────┬─────────────┐
│ id ┆ xy │
│ --- ┆ --- │
│ i64 ┆ struct[2] │
╞═════╪═════════════╡
│ 0 ┆ {1,1} │
│ 1 ┆ {null,null} │
│ 2 ┆ {null,null} │
└─────┴─────────────┘
I think I was expecting null
for the second row, that is, looking only at the last two rows:
shape: (3, 2)
┌─────┬─────────────┐
│ id ┆ xy │
│ --- ┆ --- │
│ i64 ┆ struct[2] │
╞═════╪═════════════╡
│ 1 ┆ null │
│ 2 ┆ {null,null} │
└─────┴─────────────┘
I'm going to have to have a little think about how we do this one 😊
@thomasaarholt I've updated the PR to correctly select Type[Model]
when dealing with Optional
and have applied filters on all-null subrows to encode the nullable requirement, as discussed above.
Since polars coerces None -> {null, null, ...}
when dealing with structs, I think this fix is valid. The only way I can see a more "legitimate" way of implementing this would be if polars itself either:
pl.Optional
datatype for marking structs as optional.None
in the context of structs. Like @thomasaarholt, I was surprised to discover how polars currently coerces None
for structs, as every other datatype I have come across (including pl.List
) is simply encoded as null
... there must be some good reason for doing this that is unlikely to change.I'll likely update this PR to include a few more test cases for testing optional lists of structs and optional deeply nested structs, but I don't think the current fix will need to change. What do you think?
Super! I overall agree with you - I've asked here on the discord just to make sure the implementation is correct. This is kinda an edge case, but someone might have an opinion.
Hi any news on this PR?
Let's go ahead and merge this, as well make a new release. @dsgibbons, if you want to make a few more test cases, especially demonstrating null behaviour, in another PR, then I'd appreciate it.
Yes, I'm still looking at adding additional test cases @thomasaarholt. I've been a bit busy lately but hopefully I'll be able to add more test cases over the next few weeks. Thanks for accepting the PR!
Closes #43.
This fix takes advantage of the existing validation logic by recursively calling
_find_errors
wheneverpl.Struct
orpl.List(pl.Struct)
columns are detected.This fix exposes a (possible) bug with one of the pre-existing test cases
test_model.test_missing_date_struct
. I've currently added a@pytest.mark.skip
to the failing test. Is this actually a bug, and if so, can I continue to skip this test case and raise a new issue to fix it? Do you have any insights into whyTest.model_fields['c'].annotation.columns
does not exist?