Galileo-Galilei / kedro-pandera

A kedro plugin to use pandera in your kedro projects
https://kedro-pandera.readthedocs.io/en/latest/
Apache License 2.0
34 stars 5 forks source link

Enable Offline Data Check with Jupyter #17

Open noklam opened 1 year ago

noklam commented 1 year ago

Description

Enable data checking in Jupyter Notebook.

Quotes from https://github.com/Galileo-Galilei/kedro-pandera/issues/12 I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...

interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):

  data=catalog.load("dataset_name")
  catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)

With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.

Context

This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file + DataCatalog + a pyproject.toml maybe enough to make it work.

In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.

Possible Implementation

It is already possible to validate data against a given schema defined in catalog with the pandera metadata key.

In addition to schema.validate, pandera also support decorators for pipeline It requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)

There are few options:

  1. monkeypatch a catalog.validate method
  2. Inherit the current DataCatalog class - requires change in settings.py to enable it.
  3. kedro_pandera.validate(catalog, schema)
  4. ??

Possible Alternatives

TBD