JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
252 stars 23 forks source link

[Feature] Add pre/post validators #49

Closed chainyo closed 4 weeks ago

chainyo commented 4 months ago

Hi, I discovered your project one week ago, thanks a lot for the work. I was doing something similar for data validation at my company and saw the speed of polars so decided to switch the backend to polars + patito instead of pure python + pydantic 🤗

Btw, I need the pre/post validation from pydantic to be able to manipulate data before they even get validated (e.g. transform a|b|c into a list of str [a, b, c] before evaluation).

Is it something you had in mind for this package, and/or could I contribute to it by adding this feature? @JakobGM @thomasaarholt @brendancooley

(Another issue talking about it: https://github.com/JakobGM/patito/issues/42)

thomasaarholt commented 4 months ago

Hiya! Thanks for the interest! We more than welcome PRs! Could you begin by commenting with a few examples of how you would be using it, with the desired behaviour? Then we can use those as tests.

chainyo commented 4 months ago

Hi @thomasaarholt, sure here are so examples:

  1. Convert string to booleans to make it usable directly in a "pythonic" way
    @field_validator(
        "the_name_of_the_field",
        "the_name_of_the_other_field",
        "man_we_got_another_one",
        mode="before",
    )
    @classmethod
    def normalize_booleans_from_str(cls, v: Union[str, bool, None]) -> Union[bool, None]:
        """Normalize the boolean values from the string representation."""
        if isinstance(v, bool):
            return v
        elif v is None:
            return False
        else:
            try:
                _v = v.lower()
                if _v == "true" or _v == "t":
                    return True
                elif _v == "false" or _v == "f":
                    return False
                else:
                    raise ValueError

            except Exception as e:
                raise ValueError(f"Invalid boolean value: {v}") from e
  1. To normalize poorly formatted age units

This one is a hack to fix the fact that the age units have been capitalized in the dataset and some units have been pluralized while others are singular.

    @field_validator("minimum_age_units", "maximum_age_units", mode="before")
    @classmethod
    def normalize_age_units(cls, v: Union[str, None]) -> Union[str, None]:
        """Normalize the age units to lowercase."""
        if v is None:
            return v

        if isinstance(v, str) and v != "NA":
            if v[-1].lower() != "s":
                _v = f"{v.lower()}s"
            else:
                _v = v.lower()
        else:
            raise ValueError(f"Invalid value for age units: {v}")

        return _v
  1. Convert pipe delimited list of string into a python list of strings
    @field_validator("the_field_to_modify", mode="before")
    @classmethod
    def normalize_databases_from_str(cls, v: str) -> List[str]:
        """Normalize the values from the string representation."""
        if isinstance(v, str):
            return v.split("|")
        else:
            raise TypeError(f"Invalid field value, expected a string, got {type(v)}")
thomasaarholt commented 4 months ago

I think I understand what you are desiring, but wouldn't you want these field validators to operate on polars dataframes / with polars expressions instead of on python strings / ints?

chainyo commented 4 months ago

I think I understand what you are desiring, but wouldn't you want these field validators to operate on polars dataframes / with polars expressions instead of on python strings / ints?

Sure, it's even better if it's faster than doing this in python. I was using pydantic pre/post validators because I was using pure python + pydantic, but if there is a "polars" way I'm in.

Do you have an example? Because I still need to validate the defined schemas, and if the operations are made after loading then the schema would be invalid (I got str instead of bool, and str instead of list of str)

brendancooley commented 4 months ago

@chainyo I don't think there's anything stopping you from using these validated pydantic models with patito, e.g. would something like this work in your application?

First, define the validated model

from patito import Model

class MyModel(Model):
    bool_like: bool

    @field_validator(
        "bool_like",
        mode="before",
    )
    @classmethod
    def normalize_booleans_from_str(cls, v: Union[str, bool, None]) -> Union[bool, None]:
        """Normalize the boolean values from the string representation."""
        if isinstance(v, bool):
            return v
        elif v is None:
            return False
        else:
            try:
                _v = v.lower()
                if _v == "true" or _v == "t":
                    return True
                elif _v == "false" or _v == "f":
                    return False
                else:
                    raise ValueError

            except Exception as e:
                raise ValueError(f"Invalid boolean value: {v}") from e 

Then, collect the data from your source, validate rows using MyModel, and pass to a model-aware data frame:

my_data = [MyModel(**d) for d in data]
MyModel.DataFrame(my_data)
chainyo commented 4 months ago
my_data = [MyModel(**d) for d in data]
MyModel.DataFrame(my_data)

Hi @brendancooley, this is the exact way how I define the field_validator stuff for now.

For the loading part I'm mostly reading data from files and at the moment I was using the read_csv and custom methods to read_json and read_parquet:

# CSV or TXT
Model.DataFrame.read_csv("data.csv", has_header=True)

# JSON
@classmethod
def read_json(cls: Type[SchemaType], *args, **kwargs) -> SchemaType:
    """Temporary method to read a JSON file until it's implemented in patito."""
    df = cls.DataFrame._from_pydf(polars.read_json(*args, **kwargs)._df)
    return cast(SchemaType, df.derive())

# PARQUET
@classmethod
def read_parquet(cls: Type[SchemaType], *args, **kwargs) -> SchemaType:
    """Temporary method to read a Parquet file until it's implemented in patito."""
    df = cls.DataFrame._from_pydf(polars.read_parquet(*args, **kwargs)._df)
    return cast(SchemaType, df.derive())

I'm not sure if this could be done with the way you specify things + I would like to keep it fast, so I need to check if it takes more time or if it's as fast the way you proposed.

chainyo commented 4 months ago

Hey guys quick update on my side I figured out how to use the pydantic field_validators with polars and patito. It took me some time but it was a poor comprehension of how the code was acting under the hood and especially the fact that patito doesn't handle loading (for now, but maybe it's not the goal at all). Anyways I had to find a way to use them to transition from python + pydantic to patito in the best way I could for my use case.

Here is a small code snippet:

from typing import Type, TypeVar, Union, cast

import patito
import polars
from pydantic import field_validator

SchemaType = TypeVar("SchemaType", bound="Schema")

class Schema(patito.Model):

    pubmed_id: Optional[List[str]]
    mgd_id: Optional[List[str]]
    omim_id: Optional[List[int]]

    @field_validator("pubmed_id", "mgd_id", "omim_id", mode="before")
    @classmethod
    def normalize_list_ids(cls, s: polars.Series) -> polars.Series:
        """Normalize the ids to a list of strings."""
        return s.str.split("|")

    @classmethod
    def load(cls: Type[SchemaType], file_path: Union[str, Path]) -> SchemaType:
        return cast(SchemaType, cls.read_csv(file_path, has_header=True, separator="\t"))

    @classmethod
    def read_csv(cls: Type[SchemaType], *args, **kwargs) -> SchemaType:
        if not kwargs.get("has_header", True) and "columns" not in kwargs:
            kwargs.setdefault("new_columns", cls.columns)

        naive_dtypes = {k: polars.String for k in cls.columns}
        data = polars.read_csv(*args, **kwargs, dtypes=naive_dtypes)

        for _, validator in cls.__pydantic_decorators__.field_validators.items():
            if validator.info.mode == "before":
                fields_to_update = validator.info.fields
                for field in fields_to_update:
                    data = data.with_columns(validator.func(polars.col(field)))

        return cls.DataFrame(data).cast().derive()

# So now I can do
data = Schema.load(file_path="path.csv")

Obviously I simplified the code snippet to the minimal example but in my case I have multiple abstraction so the load and read_csv aren't defined for each Schema I have but for common schemas only, but IT WORKS!

And also I plan to update the read_csv to handle the "before" and "after" mode for field_validators and to skip the naive loading if no field_validators.

@thomasaarholt @brendancooley