Closed chainyo closed 4 weeks ago
Hiya! Thanks for the interest! We more than welcome PRs! Could you begin by commenting with a few examples of how you would be using it, with the desired behaviour? Then we can use those as tests.
Hi @thomasaarholt, sure here are so examples:
@field_validator(
"the_name_of_the_field",
"the_name_of_the_other_field",
"man_we_got_another_one",
mode="before",
)
@classmethod
def normalize_booleans_from_str(cls, v: Union[str, bool, None]) -> Union[bool, None]:
"""Normalize the boolean values from the string representation."""
if isinstance(v, bool):
return v
elif v is None:
return False
else:
try:
_v = v.lower()
if _v == "true" or _v == "t":
return True
elif _v == "false" or _v == "f":
return False
else:
raise ValueError
except Exception as e:
raise ValueError(f"Invalid boolean value: {v}") from e
This one is a hack to fix the fact that the age units have been capitalized in the dataset and some units have been pluralized while others are singular.
@field_validator("minimum_age_units", "maximum_age_units", mode="before")
@classmethod
def normalize_age_units(cls, v: Union[str, None]) -> Union[str, None]:
"""Normalize the age units to lowercase."""
if v is None:
return v
if isinstance(v, str) and v != "NA":
if v[-1].lower() != "s":
_v = f"{v.lower()}s"
else:
_v = v.lower()
else:
raise ValueError(f"Invalid value for age units: {v}")
return _v
@field_validator("the_field_to_modify", mode="before")
@classmethod
def normalize_databases_from_str(cls, v: str) -> List[str]:
"""Normalize the values from the string representation."""
if isinstance(v, str):
return v.split("|")
else:
raise TypeError(f"Invalid field value, expected a string, got {type(v)}")
I think I understand what you are desiring, but wouldn't you want these field validators to operate on polars dataframes / with polars expressions instead of on python strings / ints?
I think I understand what you are desiring, but wouldn't you want these field validators to operate on polars dataframes / with polars expressions instead of on python strings / ints?
Sure, it's even better if it's faster than doing this in python. I was using pydantic pre/post validators because I was using pure python + pydantic, but if there is a "polars" way I'm in.
Do you have an example? Because I still need to validate the defined schemas, and if the operations are made after loading then the schema would be invalid (I got str instead of bool, and str instead of list of str)
@chainyo I don't think there's anything stopping you from using these validated pydantic models with patito, e.g. would something like this work in your application?
First, define the validated model
from patito import Model
class MyModel(Model):
bool_like: bool
@field_validator(
"bool_like",
mode="before",
)
@classmethod
def normalize_booleans_from_str(cls, v: Union[str, bool, None]) -> Union[bool, None]:
"""Normalize the boolean values from the string representation."""
if isinstance(v, bool):
return v
elif v is None:
return False
else:
try:
_v = v.lower()
if _v == "true" or _v == "t":
return True
elif _v == "false" or _v == "f":
return False
else:
raise ValueError
except Exception as e:
raise ValueError(f"Invalid boolean value: {v}") from e
Then, collect the data from your source, validate rows using MyModel
, and pass to a model-aware data frame:
my_data = [MyModel(**d) for d in data]
MyModel.DataFrame(my_data)
my_data = [MyModel(**d) for d in data] MyModel.DataFrame(my_data)
Hi @brendancooley, this is the exact way how I define the field_validator
stuff for now.
For the loading part I'm mostly reading data from files and at the moment I was using the read_csv
and custom methods to read_json
and read_parquet
:
# CSV or TXT
Model.DataFrame.read_csv("data.csv", has_header=True)
# JSON
@classmethod
def read_json(cls: Type[SchemaType], *args, **kwargs) -> SchemaType:
"""Temporary method to read a JSON file until it's implemented in patito."""
df = cls.DataFrame._from_pydf(polars.read_json(*args, **kwargs)._df)
return cast(SchemaType, df.derive())
# PARQUET
@classmethod
def read_parquet(cls: Type[SchemaType], *args, **kwargs) -> SchemaType:
"""Temporary method to read a Parquet file until it's implemented in patito."""
df = cls.DataFrame._from_pydf(polars.read_parquet(*args, **kwargs)._df)
return cast(SchemaType, df.derive())
I'm not sure if this could be done with the way you specify things + I would like to keep it fast, so I need to check if it takes more time or if it's as fast the way you proposed.
Hey guys quick update on my side I figured out how to use the pydantic field_validators
with polars
and patito
. It took me some time but it was a poor comprehension of how the code was acting under the hood and especially the fact that patito doesn't handle loading (for now, but maybe it's not the goal at all). Anyways I had to find a way to use them to transition from python + pydantic to patito in the best way I could for my use case.
Here is a small code snippet:
from typing import Type, TypeVar, Union, cast
import patito
import polars
from pydantic import field_validator
SchemaType = TypeVar("SchemaType", bound="Schema")
class Schema(patito.Model):
pubmed_id: Optional[List[str]]
mgd_id: Optional[List[str]]
omim_id: Optional[List[int]]
@field_validator("pubmed_id", "mgd_id", "omim_id", mode="before")
@classmethod
def normalize_list_ids(cls, s: polars.Series) -> polars.Series:
"""Normalize the ids to a list of strings."""
return s.str.split("|")
@classmethod
def load(cls: Type[SchemaType], file_path: Union[str, Path]) -> SchemaType:
return cast(SchemaType, cls.read_csv(file_path, has_header=True, separator="\t"))
@classmethod
def read_csv(cls: Type[SchemaType], *args, **kwargs) -> SchemaType:
if not kwargs.get("has_header", True) and "columns" not in kwargs:
kwargs.setdefault("new_columns", cls.columns)
naive_dtypes = {k: polars.String for k in cls.columns}
data = polars.read_csv(*args, **kwargs, dtypes=naive_dtypes)
for _, validator in cls.__pydantic_decorators__.field_validators.items():
if validator.info.mode == "before":
fields_to_update = validator.info.fields
for field in fields_to_update:
data = data.with_columns(validator.func(polars.col(field)))
return cls.DataFrame(data).cast().derive()
# So now I can do
data = Schema.load(file_path="path.csv")
Obviously I simplified the code snippet to the minimal example but in my case I have multiple abstraction so the load and read_csv aren't defined for each Schema I have but for common schemas only, but IT WORKS!
And also I plan to update the read_csv to handle the "before" and "after" mode for field_validators and to skip the naive loading if no field_validators.
@thomasaarholt @brendancooley
Hi, I discovered your project one week ago, thanks a lot for the work. I was doing something similar for data validation at my company and saw the speed of polars so decided to switch the backend to polars + patito instead of pure python + pydantic 🤗
Btw, I need the pre/post validation from pydantic to be able to manipulate data before they even get validated (e.g. transform
a|b|c
into a list of str[a, b, c]
before evaluation).Is it something you had in mind for this package, and/or could I contribute to it by adding this feature? @JakobGM @thomasaarholt @brendancooley
(Another issue talking about it: https://github.com/JakobGM/patito/issues/42)