Open nhoening opened 3 years ago
I think we can use pydantic
and pandera
to validate dataframes.
Check this simple example:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
class OutputSchema(pa.SchemaModel):
"""Schema for testing dataframe."""
column1: Series[int] = pa.Field(nullable=False)
column2: Series[str] = pa.Field(nullable=False)
class Config:
"""Consider columns that are given in schema."""
strict = True
@pa.check_types(lazy=True)
def validate_dataframe(data: DataFrame) -> DataFrame[OutputSchema]:
return data
# data example
original_data = {
"column1": [1,2,3,4,"5"],
"column2": ['1','2','4','5',1],
}
# create dataframe
df = pd.DataFrame(original_data)
try:
dataframe = validate_dataframe(df)
print(dataframe)
except pa.errors.SchemaError as error:
print(error)
@nhoening @Flix6x
We cannot validate DataFrame data with the Marshmallow code we wrote for the API parts of FlexMeasures.
I believe we should validate this data separately, as the CLI functions might often have such data. But this architecture discussion is ongoing, see here.