Closed michal-mmm closed 2 weeks ago
Hi @michal-mmm, I like the idea of adding lazy=False
when calling the validation method. We should also evaluate including tests with a PySpark dataset (and even others like Polars) in order to check that errors are raised.
In the future, we should evaluate how to handle validations with lazy=True
, for instance, with an after-pipeline-run hook.
Hi @michal-mmm, could you make the PR with that change? And see what @Galileo-Galilei thinks about it. I am happy to help if you can't
Hi, sorry for not responding earlier. I think we should go forward. I suggest that we implement in general some kwargs to be passed to the validate
function :
my_dataset:
type: ...
filepath: ...
metadata:
pandera:
schema: ...
validate_kwargs:
lazy: true
and then in the hook:
metadata["pandera"]["schema"].validate(data, **metadata["pandera"]["validate_kwargs"])
Feel free to open to a PR, and eventually suggest a different design.
Closed by #78
Description
By default,
pandera
does not raise errors forpyspark
DataFrame. Instead, it records validation errors within the df.pandera.errors attribute.e.g.
As per
pandera
documentation:Context
Currently, validating
pyspark
DataFrames directly is not possible, except by manually inspecting thepandera.error
attribute.Possible Implementation
To enforce immediate error raising during validation, one can set
lazy=False
when calling the validation method:metadata["pandera"]["schema"].validate(data, lazy=False)
This setting might be more suitable for machine learning tasks. Alternatively, validation can be toggled off using the environment variableexport PANDERA_VALIDATION_ENABLED=false
, as mentioned in the docs and #27