Open noklam opened 11 months ago
- In addition to runtime validation through annotated python
- So I’d be interested in exploring if we could use the new catalog metadata to declare this on a dataset level
- You could then generate data docs like dbt does
- You could then run something like kedro pandera test to run over persisted data
so dbt test doesn’t do runtime validation,it only does checks on persisted data or *materialised if we’re being pandemic so in a running system its good to have these kind of checks as part of CI/CD for Kedro’s purposes validation is important both at runtime and at rest
(STILL A WIP)
I will try to sum up the trials and errors I made and my current opinion about the design. It is not totaly fixed yet but I think we could make a MVP out of it quite quickly.
My first idea was the following:
Declare the schema in your catalog
iris:
type: pandas.CSVDataSet
filepath: /path/to/data/iris.csv
metadata:
pandera:
schema: <pandera_schema> # not sure neither about the format nor the name, see below
and a hook will perform runtime validation:
from pandera.
class PanderaHook:
@hook_impl
def after_context created(...):
# pseudo code, I don't know the syntax
for dataset in catalog:
dataset.metadata.pandera.df_schema=DataFrameSchema(dataset.metadata.pandera.schema)
@hook_impl
def before_node_run(
self, node: Node, catalog: DataCatalog, inputs: Dict[str, Any], is_async: bool
) -> None:
for name, data in inputs.items():
df_schema=datasetcatalog.get(name).metadata.pandera.df_schema # pseudo code, I don't know the syntax
df_schema.validate(df)
So kedro run
will validate data before running each node.
kedro_pandera
? pandera
? as a second level key?schema
if this key contains a path to the schema rather than the schema itself? Even if it is a dictionnary, it is not the schema in a pandera way, so what the less confusing name? importlib.import_module
(eg. my_package.pandera.schema.MyIrisSchema
)?
src
, so the user will have to specify the module by himself, we cannot assume too much about the structuresee: https://pandera.readthedocs.io/en/stable/dataframe_schemas.html
TODO
We can add a configuration file:
Whatever we decide, this should likely be configurable.
TODO
TODO
TODO
TODO
kedro run
because data will be loaded in memory, but maybe a CLI to check a specific dataset? Super nice work @Galileo-Galilei I'm super keen to help get this off the ground. I'm keen to write up my thoughts in detail later on, but I wanted to point to the built in methods which we should leverage here:
Yes, that's what I have been playing with. For many reasons i'll discuss later, from_yaml
is hard to use but deserialize_schema
seems the way to go.
Not too much time this week but I'll resume next week and suggest a MVP. Hopefully we can release a 0.1.0 version next week. If you want to go on without me, feel free, @noklam has the rights on the repo.
Thank you again for kicking this off on @Galileo-Galilei , I've got a few year's worth of thoughts on this topic so would love to talk about the things I'd like to see in this space.
Also I'm interested in this Frictionless schema standard Pandera has started to support as well, it looks early days - but I do love an open standard.
As per your thoughts on YAML vs Python I think we're going to have to manage 3 patterns, users will inevitably want all 3 for different reasons -
src
(although pretty hard for us to control on a plug-in level)dbt
pattern of running checks on persisted data, you manage the performance penalty in a different way and some workflows are better suited to something like a nightly check rather than a runtime one.metadata
integration like the one you suggest above feels really neat.kedro catalog validate
command and all tests on non-memory datasets a run. Bonus points if this somehow scoops up the python API tests as well as the YAML declarative ones.catalog.validate()
method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators
structurecatalog.validate("dataset_name", SchemaClass)
...dbt
has had this for years and it's just a no brainer, we could easily generate static docs describing what data is in the catalog, associated metadata and tests.All in all I'm super excited to help get this off the ground, @noklam if you could make me an editor that would be great. I'm also going to tag my colleague @mkinegm who has used Pandera a lot at QB and has some very well thought out ideas on the topic. Medium term we should also validate our ideas/roadmap once they're a bit more concrete with Neils the maintainer :).
Very nice thoughts about this. I think it's already worth creating more specific issues for some of them!
Some quick comments:
frictionless
, but I like the idea too. I had a lot of import issues while creating the MVP so the python package may be hard to handlepandera
, and this is likely the very next thing to work on !data=catalog.load("dataset_name")
catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)
With the same logic, maybe a CLI command kedro pandera validate
would be helpful too, I guess you sometimes just want to check a new dataset quickly.
kedro-viz
, but I have absoluetly no idea of how to make this work. If we end up talking with pandera maintainers themselves to validate the roadmap that could be great, but we not even close for now :)
When using coerce=True
with Pandera schemas, validating a dataset may also change the data content.
I think it would be great if the datasets that are validated are passed as inputs or outputs using their coerced+validated outputs. However, the before_pipeline_run
doesn't seem to allow changing the actual dataset being passed to a node.
Just to be sure I get it well, there are 2 separate points here :
df_schema.validate(df, coerce=True)
, the dataset is modified inplace
to convert types. This does not work in kedro-pandera
(the coercion applies only for validation but the dataset is unmodified, i.e. the pandera schema converts a field as float
but the nodes uses the data as int
). You'd like to modify the dataset on the fly so the node receives the sales types that the ones which are tested against the dataset. Do I understand correctly?
@Galileo-Galilei In both cases, what I am looking for is that after loading or saving the dataset, the dataset I will work with is the modified dataframe - because of coercion.
int
, I can be sure that the data that was saved also saved it as an int
. I don't think I need to 2 different outputs or choose between them dynamically. In my mind, the only datasets that go into a node, or are saved to locations - are the validated and possibly modified datasets.
I would like to make a PR that changes where exactly the validations happen in kedro-pandera
to illustrate what I mean.
Description
Opening the floor for feature request discussion, what do you want to see in this plugin? What should it do and what it shouldn't do? Why is it important to you?