Galileo-Galilei / kedro-pandera

A kedro plugin to use pandera in your kedro projects
https://kedro-pandera.readthedocs.io/en/latest/
Apache License 2.0
33 stars 5 forks source link

Feat/pandera hook #13

Closed Galileo-Galilei closed 1 year ago

Galileo-Galilei commented 1 year ago

Description

The goal of this PR is to create a MVP to enable runtime data validation easily in an existing kedro project.

Development notes

The goal is to enable the following user workflow in a kedro project:

  1. Generate a configuration file with the schema of a dataset you want to check at runtime. I did the following changes:

    • rename infer_schema to infer_dataset_schema because of import conflicts with pandera function
    • log by default in base instead of local because of a bug with OmegaConfigLoader which cannot resolve the key -> should open an issue in kedro's repo.
    • logs in a file catalog_{dataset_name}.yaml with an extra _{dataset_name} key to enable discovery in the catalog
  2. Let catalog datasets accept an extra metadata.pandera.schema key which will be resolved as a pandera DataframeSchema. I did the following changes:

    • add custom resolvers (pa.dict and pa.yaml) which can parse another catalog key and convert it to a DataframeSchema
    • add a after_context_created hook to register these resolvers automatically
  3. Add runtime validation each time a session is run. I did the following changes:

    • add an before_node_run which looks for a DataframeSchema in the dataset (under metadata.pandera.schema key ) and validate it against data passed to the node.

I also did the following extra actions:

Checklist

Notice

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 79.62% and project coverage change: -11.46% :warning:

Comparison is base (a35e978) 100.00% compared to head (280f088) 88.54%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #13 +/- ## ============================================ - Coverage 100.00% 88.54% -11.46% ============================================ Files 3 5 +2 Lines 52 96 +44 ============================================ + Hits 52 85 +33 - Misses 0 11 +11 ``` | [Files Changed](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9) | Coverage Δ | | |---|---|---| | [kedro\_pandera/framework/hooks/pandera\_hook.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-a2Vkcm9fcGFuZGVyYS9mcmFtZXdvcmsvaG9va3MvcGFuZGVyYV9ob29rLnB5) | `57.69% <57.69%> (ø)` | | | [kedro\_pandera/\_\_init\_\_.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-a2Vkcm9fcGFuZGVyYS9fX2luaXRfXy5weQ==) | `100.00% <100.00%> (ø)` | | | [kedro\_pandera/framework/cli/cli.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-a2Vkcm9fcGFuZGVyYS9mcmFtZXdvcmsvY2xpL2NsaS5weQ==) | `100.00% <100.00%> (ø)` | | | [kedro\_pandera/framework/config/resolvers.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-a2Vkcm9fcGFuZGVyYS9mcmFtZXdvcmsvY29uZmlnL3Jlc29sdmVycy5weQ==) | `100.00% <100.00%> (ø)` | | | [tests/conftest.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-dGVzdHMvY29uZnRlc3QucHk=) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Galileo-Galilei commented 1 year ago

@datajoely as you can see in the diff (or maybe not, this is becoming far too big) this is what was originally done, but I face several issues:

Tests should likely be refactored in the future to find a better way to do this, but I try to be pragmatic in this PR and keep changes relatively "small" (seems to be a failure 😅 )

datajoely commented 1 year ago

Good reasons! I'm happy either-way - only real value in a dynamic/configurable integration test is catching future regressions easier, but that's deffo a premature optimization!

noklam commented 1 year ago

@Galileo-Galilei Is there anything particular you would like me to review/test? I see this still in draft, anything I can help to push this to the finishing line?

Galileo-Galilei commented 1 year ago

Hi @noklam, I think this is almost good to go. I want to write tests and rebase to make commit more readable but it you want I can merge it today without all tests so you can work in parallel without too many merge conflicts to resolve.

I think the main next step is to support DataframeModel written in python and not only yaml configuration which is very restrictive. Do you have another feature in mind?

datajoely commented 1 year ago

FYI - Our teams that have adopted Pandera internally have leveraged the Python approach on the functions called by the nodes. In this situation Kedro isn't really aware of this, but our plugin could introspect the function bounded to the node object and bring these into scope.

datajoely commented 1 year ago

Additionally - shall we make a little feature backlog, because we can split up the work?

noklam commented 1 year ago

@Galileo-Galilei That's fine. This is all I need, I can start working on that tomorrow, I will branch off from this PR. I don't expect major changes so I can relies on existing structure.

@datajoely is that the @check_types annotation?

datajoely commented 1 year ago

@noklam yes exactly, the pattern in use looks something like this:

project/
├─ data_processing/
│  ├─ nodes.py  <--- Functions called call schemas via type hints
│  ├─ pipelines.py
├─ data_science/
│  ├─ ...
├─ schemas/
│  ├─ raw_preprocessing.py
│  |─ prm_features.py

Outside of the node definition, where the bound function is defined the signature looks something like this:

import pandera as pa
from pandera.typing import DataFrame

@pa.check_types
def preprocess_data(raw_data: DataFrame,...) -> DataFrame[RawDataSchema]:
   ...

Now this is independent of the Kedro catalog, but we could tie it back to Kedro world through introspection

Galileo-Galilei commented 1 year ago

Thanks @datajoely for the information about the check_types decorator. I think this is the "right way" to integrate pandera to standard function (mostly because when you don't have the dataaset/pipeline abstractions, you want to be protected against running the function against invalid data), but I think this is not necessarily what we should encourage to do. Conceptually, schema belongs to datasets and we should keep the decoupling between I/O and compute that kedro offers to have more flexibility (e.g. to turn validation off, to validate data manually and not at runtime, visualize data schema, document the pipeline...)

Additionally - shall we make a little feature backlog, because we can split up the work?

Absolutely agree on this. I suggest to keep what you do in the kedro repo:

At this stage (no users, not compatible with any public kedro version, no public declaration that the plugin exist, 8 stars), I am really comfortable with having a fast development pace, breaking things and iterating so feel free to go and merge without my validation. Havings an issue describing roughly what is at stake is enough for synchronization for now IMHO.

@noklam I've just rebased so it may have messed up your hsitory. It's likely better to start from the main branch, i'll proceed on merging it

noklam commented 1 year ago

I just go on to create some issues first - I will fill that issues up eventually as I also have habit to read/write my issues during my commute without internet.

Agree at this point it's small, let's keep the administrative work in a minimal base.