Raising errors for pyspark dataframe validation

michal-mmm commented 1 month ago

Description

By default, pandera does not raise errors for pyspark DataFrame. Instead, it records validation errors within the df.pandera.errors attribute.

e.g.

df = metadata["pandera"]["schema"].validate(df)
df.pandera.errors
defaultdict(<function ErrorHandler.__init__.<locals>.<lambda> at 0x30ae9c550>, {'SCHEMA': defaultdict(<class 'list'>, {'WRONG_DATATYPE': [{'schema': 'IrisPySparkSchema', 'column': 'sepal_length', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'sepal_width', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_width' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_length', 'check': "dtype('StringType()')", 'error': "expected column 'petal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_width', 'check': "dtype('StringType()')", 'error': "expected column 'petal_width' to have type StringType(), got DoubleType()"}]})})

As per pandera documentation:

This design decision is based on the expectation that most use cases for pyspark SQL dataframes means entails a production ETL setting. In these settings, pandera prioritizes completing the production load and saving the data quality issues for downstream rectification.

Context

Currently, validating pyspark DataFrames directly is not possible, except by manually inspecting the pandera.error attribute.

Possible Implementation

To enforce immediate error raising during validation, one can set lazy=False when calling the validation method: metadata["pandera"]["schema"].validate(data, lazy=False) This setting might be more suitable for machine learning tasks. Alternatively, validation can be toggled off using the environment variable export PANDERA_VALIDATION_ENABLED=false, as mentioned in the docs and #27

felipemonroy commented 4 weeks ago

Hi @michal-mmm, I like the idea of adding lazy=False when calling the validation method. We should also evaluate including tests with a PySpark dataset (and even others like Polars) in order to check that errors are raised.

In the future, we should evaluate how to handle validations with lazy=True, for instance, with an after-pipeline-run hook.

felipemonroy commented 3 weeks ago

Hi @michal-mmm, could you make the PR with that change? And see what @Galileo-Galilei thinks about it. I am happy to help if you can't

Galileo-Galilei commented 3 weeks ago

Hi, sorry for not responding earlier. I think we should go forward. I suggest that we implement in general some kwargs to be passed to the validate function :

my_dataset: 
    type: ...
    filepath: ...
    metadata: 
        pandera: 
            schema: ...            
            validate_kwargs: 
                lazy: true

and then in the hook:

metadata["pandera"]["schema"].validate(data, **metadata["pandera"]["validate_kwargs"])

Feel free to open to a PR, and eventually suggest a different design.

Galileo-Galilei commented 2 weeks ago

Closed by #78

Galileo-Galilei / kedro-pandera