Galileo-Galilei / kedro-pandera

A kedro plugin to use pandera in your kedro projects
https://kedro-pandera.readthedocs.io/en/latest/
Apache License 2.0
33 stars 3 forks source link

What do you want to see in `kedro-pandera`? #12

Open noklam opened 11 months ago

noklam commented 11 months ago

Description

Opening the floor for feature request discussion, what do you want to see in this plugin? What should it do and what it shouldn't do? Why is it important to you?

noklam commented 11 months ago
  1. In addition to runtime validation through annotated python
  2. So I’d be interested in exploring if we could use the new catalog metadata to declare this on a dataset level
  3. You could then generate data docs like dbt does
  4. You could then run something like kedro pandera test to run over persisted data

so dbt test doesn’t do runtime validation,it only does checks on persisted data or *materialised if we’re being pandemic so in a running system its good to have these kind of checks as part of CI/CD for Kedro’s purposes validation is important both at runtime and at rest

Galileo-Galilei commented 11 months ago

(STILL A WIP)

I will try to sum up the trials and errors I made and my current opinion about the design. It is not totaly fixed yet but I think we could make a MVP out of it quite quickly.

First attempt : validation at runtime

My first idea was the following:

Declare the schema in your catalog

iris: 
    type: pandas.CSVDataSet
        filepath: /path/to/data/iris.csv
        metadata: 
            pandera: 
                schema: <pandera_schema> # not sure neither about the format nor the name, see below

and a hook will perform runtime validation:

from pandera. 
class PanderaHook:
    @hook_impl
    def after_context created(...):
        # pseudo code, I don't know the syntax
        for dataset in catalog:
            dataset.metadata.pandera.df_schema=DataFrameSchema(dataset.metadata.pandera.schema)

    @hook_impl
    def before_node_run(
        self, node: Node, catalog: DataCatalog, inputs: Dict[str, Any], is_async: bool
    ) -> None:
        for name, data in inputs.items():
        df_schema=datasetcatalog.get(name).metadata.pandera.df_schema # pseudo code, I don't know the syntax
        df_schema.validate(df)

So kedro run will validate data before running each node.

Open questions about catalog configuration

Shoulds we use the "metadata" key to store ?

How many nested level should we use in the metadata key?

What should the schema key contain:

Which other key should we have?

see: https://pandera.readthedocs.io/en/stable/dataframe_schemas.html

Open questions about plugin configuration

TODO

How to add advanced configuration capabilites to the plugin ?

We can add a configuration file:

What level of lazy validation should we enable ?

When should validation be triggered?

Whatever we decide, this should likely be configurable.

Temporarily avoid validation, or only for given pipelines

TODO

Open questions about runtime validation?

TODO

CLI

TODO

How can we generate default schema and test for a dataset?

Should we let users generate the schema of several datasets at the same time?

TODO

Other desirable features:

datajoely commented 11 months ago

Super nice work @Galileo-Galilei I'm super keen to help get this off the ground. I'm keen to write up my thoughts in detail later on, but I wanted to point to the built in methods which we should leverage here:

Galileo-Galilei commented 11 months ago

Yes, that's what I have been playing with. For many reasons i'll discuss later, from_yaml is hard to use but deserialize_schema seems the way to go.

Not too much time this week but I'll resume next week and suggest a MVP. Hopefully we can release a 0.1.0 version next week. If you want to go on without me, feel free, @noklam has the rights on the repo.

datajoely commented 11 months ago

Thank you again for kicking this off on @Galileo-Galilei , I've got a few year's worth of thoughts on this topic so would love to talk about the things I'd like to see in this space.

Also I'm interested in this Frictionless schema standard Pandera has started to support as well, it looks early days - but I do love an open standard.

As per your thoughts on YAML vs Python I think we're going to have to manage 3 patterns, users will inevitably want all 3 for different reasons -

1. Online checks

2. Offline checks

3. Interactive workflow (Jupyter)

4. Data docs

5. Datasets in scope

6. Inferered schemas

All in all I'm super excited to help get this off the ground, @noklam if you could make me an editor that would be great. I'm also going to tag my colleague @mkinegm who has used Pandera a lot at QB and has some very well thought out ideas on the topic. Medium term we should also validate our ideas/roadmap once they're a bit more concrete with Neils the maintainer :).

Galileo-Galilei commented 11 months ago

Very nice thoughts about this. I think it's already worth creating more specific issues for some of them!

Some quick comments:

If we end up talking with pandera maintainers themselves to validate the roadmap that could be great, but we not even close for now :)

Lodewic commented 9 months ago

When using coerce=True with Pandera schemas, validating a dataset may also change the data content.

I think it would be great if the datasets that are validated are passed as inputs or outputs using their coerced+validated outputs. However, the before_pipeline_run doesn't seem to allow changing the actual dataset being passed to a node.

Galileo-Galilei commented 9 months ago

Just to be sure I get it well, there are 2 separate points here :

Lodewic commented 9 months ago

@Galileo-Galilei In both cases, what I am looking for is that after loading or saving the dataset, the dataset I will work with is the modified dataframe - because of coercion.

I don't think I need to 2 different outputs or choose between them dynamically. In my mind, the only datasets that go into a node, or are saved to locations - are the validated and possibly modified datasets.


I would like to make a PR that changes where exactly the validations happen in kedro-pandera to illustrate what I mean.