JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
252 stars 23 forks source link

Syntax for specifying missing columns #59

Open thomasaarholt opened 4 months ago

thomasaarholt commented 4 months ago

Currently, a type specification of Optional[int] means that a column must be of integer type but may contain nulls.

We currently don't support a syntax to specify that it is allowed that a column is missing.

One current workaround is to specify Foo.validate(df, allow_missing_columns=True), where allow_missing_columns is passed on to _find_errors as a kwarg (we should add this as an explicit parameter).

The following example contains a suggestion for how we could allow missing columns (see c). It is one that @JakobGM came up with last year.

import patito as pt
from typing import Optional

class Foo(pt.Model):
    a: int # only ints
    b: Optional[int] # mix ints and nulls
    c: int = None # column may be missing, but if it's there it must be an int - but this fails a type check
    d: Optional[int] = None # column may be missing, but if it's there it must be an int

An alternative would be to use pt.Field / ColumnInfo, and do something like the following, which I might like better, just because it will pass type checks.

class Foo(pt.Model):
    c: int = pt.Field(allow_missing=True)

I am very open to ideas here. Does anyone have a suggestion? Tagging a few possibly-interested parties, @brendancooley, @dsgibbons, @ion-elgreco

ion-elgreco commented 4 months ago

Hey @thomasaarholt, I would prefer pt.Field(allow_missing=True) because if you say c: int = None it's not entirely clear what is happening and also how can one still add a pt.Field with specific settings on a field.

A kwarg in pt.Field seems most clear and flexible to me.

And also I like the idea of allowing specific columns to be missing! :)

dsgibbons commented 4 months ago

+1 for pt.Field(allow_missing=True)

brendancooley commented 4 months ago

+1 for allow_missing. A related feature to consider is validation on derived_from and constraints column dependencies. We can inspect which columns are required to compute a derivation or constraint using expr.meta.root_names() and check

  1. that the column is present on the model and
  2. that allow_missing is False

Perhaps we could insert these checks into the Model.validate_schema method.

dsgibbons commented 2 days ago

I may be interested in giving this a go - are we happy to pursue pt.Field(allow_missing=True)?

ion-elgreco commented 2 days ago

@dsgibbons yeah I think there is a consensus on allow_missing! Go ahead : )