Galileo-Galilei / kedro-pandera

A kedro plugin to use pandera in your kedro projects
https://kedro-pandera.readthedocs.io/en/latest/
Apache License 2.0
33 stars 5 forks source link

Support DataframeModel and the python API for declaring schema #18

Closed noklam closed 1 year ago

noklam commented 1 year ago

Description

In addition to the YAML API, we should support the class-base API DataFrameModel (pydantic)

Context

TBD

Possible Implementation

TBD

Possible Alternatives

TBD

Galileo-Galilei commented 1 year ago

Hi @noklam, I think there are several sub tasks to this ticket but not all with the same priority.

  1. I think (to be verified) that if we create a custom resolver which can parse a DataFrameModel it will be enough for the hook to work "as is" with the exact same syntax. Something like :
my_data: 
    type: ...
    filepath: ...
    metadata: 
        pandera: 
            schema: ${pa.python: my_kedro_package.schemas.my_data.MyDataSchema} # we should "just" create the resovler which will import and instantiate the class

Does the design look ok for you? Do you have time to work on this one?

  1. We can add a CLI to infer this schema and generate a my_kedro_package.schemas.my_data.py file (pseudo code below):
# my_kedro_package.schemas.my_data.py

from pandera import DataframeModel
from pandera.io import Field

class MyDataSchema(DataframeModel):
    var1: str = Field()
    var2: <var_typ>e= Field()
...
noklam commented 1 year ago

How does 2. different from the current infer CLI? I'll work on 1.

Galileo-Galilei commented 1 year ago

The current CLI has a flag --python for this but it is not implemented. The little difference is that the infer CLI for yaml use a built-in pandera function which creates basic tests and the file, but there is no such function for python so we should create it on our own. that is why I want to keep it really simple, I want to avoid creating boilerplate code, not really infering advanced tests.

noklam commented 1 year ago

The current CLI has a flag --python for this but it is not implemented I am not sure what do you mean, I thought this function exist already? isn't it using the schema.to_script() method?

Pandera natively support convert DataFrameModel -> DataFrameSchema, but not the other way round.