Feat/pandera hook - Githubissues

Galileo-Galilei commented 1 year ago

Description

The goal of this PR is to create a MVP to enable runtime data validation easily in an existing kedro project.

Development notes

The goal is to enable the following user workflow in a kedro project:

Generate a configuration file with the schema of a dataset you want to check at runtime. I did the following changes:
- rename infer_schema to infer_dataset_schema because of import conflicts with pandera function
- log by default in base instead of local because of a bug with OmegaConfigLoader which cannot resolve the key -> should open an issue in kedro's repo.
- logs in a file catalog_{dataset_name}.yaml with an extra _{dataset_name} key to enable discovery in the catalog
Let catalog datasets accept an extra metadata.pandera.schema key which will be resolved as a pandera DataframeSchema. I did the following changes:
- add custom resolvers (pa.dict and pa.yaml) which can parse another catalog key and convert it to a DataframeSchema
- add a after_context_created hook to register these resolvers automatically
Add runtime validation each time a session is run. I did the following changes:
- add an before_node_run which looks for a DataframeSchema in the dataset (under metadata.pandera.schema key ) and validate it against data passed to the node.

I also did the following extra actions:

Write a tutorial which describes this workflow on the pandas-iris starter
Make default logging to INFO so the plugin logs can be seen by user by default
extra cleanup for migration to ruff linter
Fix readthedocs build

Checklist

[x] Read the contributing guidelines
[X] Open this PR as a 'Draft Pull Request' if it is work-in-progress
[x] Update the documentation to reflect the code changes
[x] Add a description of this change and add your name to the list of supporting contributions in the CHANGELOG.md file. Please respect Keep a Changelog guidelines.
[x] Add tests to cover your changes (only partially, to be continued)

Notice

[x] I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.
I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 79.62% and project coverage change: -11.46% :warning:

Comparison is base (a35e978) 100.00% compared to head (280f088) 88.54%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #13 +/- ## ============================================ - Coverage 100.00% 88.54% -11.46% ============================================ Files 3 5 +2 Lines 52 96 +44 ============================================ + Hits 52 85 +33 - Misses 0 11 +11 ``` | [Files Changed](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9) | Coverage Δ | | |---|---|---| | [kedro\_pandera/framework/hooks/pandera\_hook.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-a2Vkcm9fcGFuZGVyYS9mcmFtZXdvcmsvaG9va3MvcGFuZGVyYV9ob29rLnB5) | `57.69% <57.69%> (ø)` | | | [kedro\_pandera/\_\_init\_\_.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-a2Vkcm9fcGFuZGVyYS9fX2luaXRfXy5weQ==) | `100.00% <100.00%> (ø)` | | | [kedro\_pandera/framework/cli/cli.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-a2Vkcm9fcGFuZGVyYS9mcmFtZXdvcmsvY2xpL2NsaS5weQ==) | `100.00% <100.00%> (ø)` | | | [kedro\_pandera/framework/config/resolvers.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-a2Vkcm9fcGFuZGVyYS9mcmFtZXdvcmsvY29uZmlnL3Jlc29sdmVycy5weQ==) | `100.00% <100.00%> (ø)` | | | [tests/conftest.py](https://app.codecov.io/gh/Galileo-Galilei/kedro-pandera/pull/13?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Yolan+Honor%C3%A9-Roug%C3%A9#diff-dGVzdHMvY29uZnRlc3QucHk=) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Galileo-Galilei commented 1 year ago

@datajoely as you can see in the diff (or maybe not, this is becoming far too big) this is what was originally done, but I face several issues:

I don't really know why, it seems that it is not cached and the pytest command gets reallys slow because of cloning each time I run the pytest command
I am often working with no internet connection (or the one of my phone with network constraints) and this prevents me from running tests locally
I have to modify the template a little bit (make OmegaConfigLoader the default, remove tests/ folder to avoid conflicts with pytest...) and it was easier to do it manually rather than programatically (overwriting file, mocking imports...)
This is not supposed to be an "integration" test but rather a unit one: I don't really want the test to fail because of changes in the remote starter if I have a pinned kedro version

Tests should likely be refactored in the future to find a better way to do this, but I try to be pragmatic in this PR and keep changes relatively "small" (seems to be a failure 😅 )

datajoely commented 1 year ago

Good reasons! I'm happy either-way - only real value in a dynamic/configurable integration test is catching future regressions easier, but that's deffo a premature optimization!

noklam commented 1 year ago

@Galileo-Galilei Is there anything particular you would like me to review/test? I see this still in draft, anything I can help to push this to the finishing line?

Galileo-Galilei commented 1 year ago

Hi @noklam, I think this is almost good to go. I want to write tests and rebase to make commit more readable but it you want I can merge it today without all tests so you can work in parallel without too many merge conflicts to resolve.

I think the main next step is to support DataframeModel written in python and not only yaml configuration which is very restrictive. Do you have another feature in mind?

datajoely commented 1 year ago

FYI - Our teams that have adopted Pandera internally have leveraged the Python approach on the functions called by the nodes. In this situation Kedro isn't really aware of this, but our plugin could introspect the function bounded to the node object and bring these into scope.

datajoely commented 1 year ago

Additionally - shall we make a little feature backlog, because we can split up the work?

noklam commented 1 year ago

@Galileo-Galilei That's fine. This is all I need, I can start working on that tomorrow, I will branch off from this PR. I don't expect major changes so I can relies on existing structure.

@datajoely is that the @check_types annotation?

datajoely commented 1 year ago

@noklam yes exactly, the pattern in use looks something like this:

project/
├─ data_processing/
│  ├─ nodes.py  <--- Functions called call schemas via type hints
│  ├─ pipelines.py
├─ data_science/
│  ├─ ...
├─ schemas/
│  ├─ raw_preprocessing.py
│  |─ prm_features.py

Outside of the node definition, where the bound function is defined the signature looks something like this:

import pandera as pa
from pandera.typing import DataFrame

@pa.check_types
def preprocess_data(raw_data: DataFrame,...) -> DataFrame[RawDataSchema]:
   ...

Now this is independent of the Kedro catalog, but we could tie it back to Kedro world through introspection

Galileo-Galilei commented 1 year ago

Thanks @datajoely for the information about the check_types decorator. I think this is the "right way" to integrate pandera to standard function (mostly because when you don't have the dataaset/pipeline abstractions, you want to be protected against running the function against invalid data), but I think this is not necessarily what we should encourage to do. Conceptually, schema belongs to datasets and we should keep the decoupling between I/O and compute that kedro offers to have more flexibility (e.g. to turn validation off, to validate data manually and not at runtime, visualize data schema, document the pipeline...)

Additionally - shall we make a little feature backlog, because we can split up the work?

Absolutely agree on this. I suggest to keep what you do in the kedro repo:

we can open issues for each feature request we have in mind, an describe what we want to work on. This enable doing things aysnchronously because I use to read issues during the day on my phone even if I don't have access to my latptop and can code.
if a feature is ill defined / too big, we can split it up in smaller ones and eventually keep it as a parent. We should not be afraid to close some issues once things are clearer to create smaller tickets.
if some features appear to belong to a much broader topic, we can eventually crearte milestones to keep our ideas clear.

At this stage (no users, not compatible with any public kedro version, no public declaration that the plugin exist, 8 stars), I am really comfortable with having a fast development pace, breaking things and iterating so feel free to go and merge without my validation. Havings an issue describing roughly what is at stake is enough for synchronization for now IMHO.

@noklam I've just rebased so it may have messed up your hsitory. It's likely better to start from the main branch, i'll proceed on merging it

noklam commented 1 year ago

I just go on to create some issues first - I will fill that issues up eventually as I also have habit to read/write my issues during my commute without internet.

Agree at this point it's small, let's keep the administrative work in a minimal base.

Galileo-Galilei / kedro-pandera

Feat/pandera hook #13

Description

Development notes

Checklist

Notice

Codecov Report