What do you want to see in `kedro-pandera`?

noklam commented 11 months ago

Description

Opening the floor for feature request discussion, what do you want to see in this plugin? What should it do and what it shouldn't do? Why is it important to you?

noklam commented 11 months ago

In addition to runtime validation through annotated python

So I’d be interested in exploring if we could use the new catalog metadata to declare this on a dataset level

You could then generate data docs like dbt does

You could then run something like kedro pandera test to run over persisted data

so dbt test doesn’t do runtime validation,it only does checks on persisted data or *materialised if we’re being pandemic so in a running system its good to have these kind of checks as part of CI/CD for Kedro’s purposes validation is important both at runtime and at rest

Galileo-Galilei commented 11 months ago

(STILL A WIP)

I will try to sum up the trials and errors I made and my current opinion about the design. It is not totaly fixed yet but I think we could make a MVP out of it quite quickly.

First attempt : validation at runtime

My first idea was the following:

Declare the schema in your catalog

iris: 
    type: pandas.CSVDataSet
        filepath: /path/to/data/iris.csv
        metadata: 
            pandera: 
                schema: <pandera_schema> # not sure neither about the format nor the name, see below

and a hook will perform runtime validation:

from pandera. 
class PanderaHook:
    @hook_impl
    def after_context created(...):
        # pseudo code, I don't know the syntax
        for dataset in catalog:
            dataset.metadata.pandera.df_schema=DataFrameSchema(dataset.metadata.pandera.schema)

    @hook_impl
    def before_node_run(
        self, node: Node, catalog: DataCatalog, inputs: Dict[str, Any], is_async: bool
    ) -> None:
        for name, data in inputs.items():
        df_schema=datasetcatalog.get(name).metadata.pandera.df_schema # pseudo code, I don't know the syntax
        df_schema.validate(df)

So kedro run will validate data before running each node.

Open questions about catalog configuration

Shoulds we use the "metadata" key to store ?

pb for interactive use in notebook?
- What about an encapsulating dataset instead, like MLflowArtifactDataset? This is much more complicated though.
- What about an after_catalog_created hook which decorate all load/save methods? This will not work when creating the dataset outside a context though.

How many nested level should we use in the metadata key?

should we use kedro_pandera? pandera? as a second level key?
should we use schema if this key contains a path to the schema rather than the schema itself? Even if it is a dictionnary, it is not the schema in a pandera way, so what the less confusing name?

What should the schema key contain:

a path to a yaml file?
- Advantages :
  - delegate everything to pandera
- Drawbacks :
  - we need to hard code the path to the yaml file which is a big problem when deploying, users will need to inject project path at runtime dynmaically with Templated/OmegaConfigLoader, looks like a bad development experience
  - the schema is really hard to write manually, we need to add an helper to generate it
a path to a python module that we import with importlib.import_module (eg. my_package.pandera.schema.MyIrisSchema)?
- Advantages : the syntax to create a schema is much easier AND we can define custom checks instead of only using built-in ones
- Drawback : this should rather been located in src, so the user will have to specify the module by himself, we cannot assume too much about the structure
The schema as a dictionary.
- Advantages: this is very explicit and a good "documentation" in the catalog Drawbacks :
  - This is very verbose and the catalog will soon be overcrowded.
  - We can leverage ConfigLoader to generate a schema with the same name in another catalog_schema_iris.yaml file, but I am afraid this will not be very clear that the 2 are merged automagically. Another option is to refer to it as ${iris_schema} but this will require users to add this line manually for each entry of the catalog
should we enable several of above solutons? all solutions? If yes, we should likely add a second metadata key to indicate how to load the schema? what would be its name and possible values?

Which other key should we have?

see: https://pandera.readthedocs.io/en/stable/dataframe_schemas.html

dtypes?
coerce?
strict?
Index?

Open questions about plugin configuration

TODO

How to add advanced configuration capabilites to the plugin ?

We can add a configuration file:

advantage: enable lot of customisation
drawback: more complex to use

What level of lazy validation should we enable ?

fail immediately the pipeline when a check fails?
fail immediately the pipeline when a dataset fails but try all checks of the dataset before failing?
run the pipeline until the end despite validation errors and raise an error at the end?
run the pipeline until the end despite validation errors and only raise warnings?

When should validation be triggered?

on dataset save?
on dataset load?
both?

Whatever we decide, this should likely be configurable.

Temporarily avoid validation, or only for given pipelines

TODO

Open questions about runtime validation?

TODO

CLI

TODO

How can we generate default schema and test for a dataset?

load a dataset in memory (highly inefficient)
infer schema
persist the schema (where - src/ or conf/ - and how - python, yaml, ... - still to be defined)

Should we let users generate the schema of several datasets at the same time?

TODO

Other desirable features:

generate fake data with the CLI as a demo
run the pipeline with fake generated input for tests
validation for dataset which do not fit in memory? i don't really see how it will work with kedro run because data will be loaded in memory, but maybe a CLI to check a specific dataset?

datajoely commented 11 months ago

Super nice work @Galileo-Galilei I'm super keen to help get this off the ground. I'm keen to write up my thoughts in detail later on, but I wanted to point to the built in methods which we should leverage here:

Galileo-Galilei commented 11 months ago

Yes, that's what I have been playing with. For many reasons i'll discuss later, from_yaml is hard to use but deserialize_schema seems the way to go.

Not too much time this week but I'll resume next week and suggest a MVP. Hopefully we can release a 0.1.0 version next week. If you want to go on without me, feel free, @noklam has the rights on the repo.

datajoely commented 11 months ago

Thank you again for kicking this off on @Galileo-Galilei , I've got a few year's worth of thoughts on this topic so would love to talk about the things I'd like to see in this space.

Also I'm interested in this Frictionless schema standard Pandera has started to support as well, it looks early days - but I do love an open standard.

As per your thoughts on YAML vs Python I think we're going to have to manage 3 patterns, users will inevitably want all 3 for different reasons -

Python schema declared in catalog
YAML schema declared in catalog
Python schema annotated in src (although pretty hard for us to control on a plug-in level)

1. Online checks

There is a run-time performance penalty every time the python function is called. I might be wrong on this, but I wasn't able to find a way of disabling checks on a global level.
If we wanted to provide say a CLI option to disable runtime checks we can ignore catalog checks, but for the annotated checks we may need to do some gross monkeypatching.

2. Offline checks

This is more like the dbt pattern of running checks on persisted data, you manage the performance penalty in a different way and some workflows are better suited to something like a nightly check rather than a runtime one.
This is somewhere where the Kedro catalog metadata integration like the one you suggest above feels really neat.
I would love a world where we have a kedro catalog validate command and all tests on non-memory datasets a run. Bonus points if this somehow scoops up the python API tests as well as the YAML declarative ones.

3. Interactive workflow (Jupyter)

I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure
We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...

4. Data docs

Again dbt has had this for years and it's just a no brainer, we could easily generate static docs describing what data is in the catalog, associated metadata and tests.
There is also an obvious integration point with enterprise catalogs like Alation/Colibra/Amundsen

5. Datasets in scope

I think if we support Pandas and Spark we cover the vast majority of use cases
Ibis and Polars would be my next picks if Pandera were to support them.

6. Inferered schemas

I get the motivation in inferring schemas as a way of bootstrapping users to do this properly, but I do like to err on the side of explicit rather than implicit.

All in all I'm super excited to help get this off the ground, @noklam if you could make me an editor that would be great. I'm also going to tag my colleague @mkinegm who has used Pandera a lot at QB and has some very well thought out ideas on the topic. Medium term we should also validate our ideas/roadmap once they're a bit more concrete with Neils the maintainer :).

Galileo-Galilei commented 11 months ago

Very nice thoughts about this. I think it's already worth creating more specific issues for some of them!

Some quick comments:

don't know about frictionless, but I like the idea too. I had a lot of import issues while creating the MVP so the python package may be hard to handle
schema types support: we clearly need to support all the way to create such schema in pandera, and this is likely the very next thing to work on !
Online check : I have not used pandera enough to know how painful the performance penalty is, but this is really something to take into account. Regarding a potential flag to deactivate pander agloablly, I found the PANDERA_VALIDATION_ENABLED in documentation since 0.16.X but I don't know if it apply only to spark because it is in the spark section of the documentation?
interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):
```
data=catalog.load("dataset_name")
catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)
```
With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.
data docs: I have mitigated feelings. I'd really want a way to use all the datasets metadata to document the project but I don't know what the right design is. I think that if we want to create HTML documentation from the schema, this rather belongs to pandera itself. A feature I'd love to have is a previex of the schema in kedro-viz, but I have absoluetly no idea of how to make this work.
datasets in scope : I think we should aime to be agnostic and support any dataset supported by the pandera backend, but if we need to reduce the scope, pandas and spark are the way to go ;)
inferred schema: Totally agree that they could be abused and they are not something we want to encourage users to use without modifications. That said from a development perspective the yaml file is far too tedious to write manually (especially if we have a lot of colum,s), we really need to support a way to generate it out of the box.

If we end up talking with pandera maintainers themselves to validate the roadmap that could be great, but we not even close for now :)

Lodewic commented 9 months ago

When using coerce=True with Pandera schemas, validating a dataset may also change the data content.

I think it would be great if the datasets that are validated are passed as inputs or outputs using their coerced+validated outputs. However, the before_pipeline_run doesn't seem to allow changing the actual dataset being passed to a node.

Galileo-Galilei commented 9 months ago

Just to be sure I get it well, there are 2 separate points here :

in pandera, if you do df_schema.validate(df, coerce=True), the dataset is modified inplace to convert types. This does not work in kedro-pandera (the coercion applies only for validation but the dataset is unmodified, i.e. the pandera schema converts a field as float but the nodes uses the data as int). You'd like to modify the dataset on the fly so the node receives the sales types that the ones which are tested against the dataset. Do I understand correctly?
- for the second point, this is not totally clear to me : you want us to return 2 different outputs so you can choose between the one you want dynamically?

Lodewic commented 9 months ago

@Galileo-Galilei In both cases, what I am looking for is that after loading or saving the dataset, the dataset I will work with is the modified dataframe - because of coercion.

When loading: I am type-hinting my nodes with the Pandera schema that is expected in a function, the kedro-pandera plugin can help me ensure that that is exactly the data I will be receiving in the node.
When saving: In the catalog we have specified the schema of the dataframe. By allowing coerce=True, I can be sure that the dataset that was saved is actually matching the schema. So when my schema says a column is an int, I can be sure that the data that was saved also saved it as an int.

I don't think I need to 2 different outputs or choose between them dynamically. In my mind, the only datasets that go into a node, or are saved to locations - are the validated and possibly modified datasets.

I would like to make a PR that changes where exactly the validations happen in kedro-pandera to illustrate what I mean.

Galileo-Galilei / kedro-pandera