JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
252 stars 23 forks source link

How to validate an ordered categorical column? #71

Open teddygroves opened 3 months ago

teddygroves commented 3 months ago

This topic probably belongs in a discussion forum but I couldn't find one for patito. Please let me know if there is a better place to ask this.

I would like to use patito to validate a dataframe with a categorical column with known categories where the order of the categories is important. What I have done so far is as follows:

from typing import Literal, get_args

import patito as pt
import polars as pl

class MyModel(pt.Model):
    my_col: Literal["a", "b"]

my_dtype = pl.Enum([*get_args(MyModel.model_fields["my_col"].annotation)])

good_df = pl.DataFrame({"my_col": pl.Series(["b", "a"], dtype=my_dtype)})
bad_df = pl.DataFrame(
    {"my_col": pl.Series(["b", "a"], dtype=pl.Enum(["b", "a"]))}
)

MyModel.validate(good_df)
MyModel.validate(bad_df)

This passes for good_df and fails for bad_df as expected. However I'm not 100% sure that this is the intended use of Literal in a patito model, and it was a little awkward to get the correctly ordered categories to put in my custom dtype so I thought I'd ask to see if there's a better (or just different) way to do this.