Galileo-Galilei / kedro-pandera

A kedro plugin to use pandera in your kedro projects
https://kedro-pandera.readthedocs.io/en/latest/
Apache License 2.0
33 stars 5 forks source link

[QUESTION] Could integration with Ibis be supported? #88

Open galenseilis opened 1 week ago

galenseilis commented 1 week ago

Description

I am exploring using a combo of Ibis, Kedro, and Pandera if that's possible.

Context

With Ibis I could be writing more consistent dataframe code regardless of the backend (e.g. Polars, MSSQL, or PostgreSQL) while having faster performance than Pandas, and also solving the parametrization problem that comes with integrating Python and SQL. With Kedro I get consistent data science project structures. With Pandera I get dataframe data validation. Everyone that cares about those things will similarly benefit from Kedro-Pandera integration with Ibis.

I would like something highly similar to what I see in the Kedro-Pandera plugin's documentation, except to also support Ibis datasets.

Possible Implementation

I'm not currently familiar with the internals of the Kedro-Pandera, so my suggestion will be somewhat limited to that lack of understanding.

Because Kedro-Pandera is responsible for an integration of Kedro and Pandera, the implementation should depend on current behaviour Kedro, Pandera, and Ibis rather than modifying their behaviour.

I've noted that Pandera supports Polars in addition to Pandas, however Ibis has its own classes that I do not expect Pandera to have support for. Rather, the implementation could take advantage of the fact that the Ibis dataframe objects will have either of to_pandas or to_polars.

Here is a summary of the logic I have in mind:

Possible Alternatives

Another option is for me to have a Kedro pipeline for this type of validation instead. This would involve casting the Ibis table dataset to a polars dataframe myself, and loading the schema itself as a yaml Kedro dataset, and running the Pandera validator against the Polars dataset.

Galileo-Galilei commented 1 week ago

This is definitely valuable and should be added to the roadmap.

TBH I have hard times recently to maintain the plugins, and kedro-pandera is quite inactive. I plan to resume working on it one day, but I can't provide a time when I will resume development of kedro-pandera.

I definitely will accept and release PR though.

noklam commented 1 week ago

Similar situation, I cannot take on any active development work but I can spare some time on PR review if someone is willing to spend time on this.

deepyaman commented 1 week ago

@galenseilis Thanks for linking this issue! I wasn't previously aware of it.

I would love to see Kedro + Ibis + Pandera working together as the core foundations of a composable Python-first analytics stack.

I've noted that Pandera supports Polars in addition to Pandas, however Ibis has its own classes that I do not expect Pandera to have support for. Rather, the implementation could take advantage of the fact that the Ibis dataframe objects will have either of to_pandas or to_polars.

I am actively working on supporting Ibis as a backend on Pandera, with the goal of at least getting it to parity with the Polars backend by the end of the year/early Q1 2025. I think the right way to support Ibis here would be to leverage that integration, rather than via conversion to Polars or pandas (although that could be an interim solution, if you're looking for something more immediate), because we want to be able to efficiently validate data on e.g. a database backend, too.

I will admit that I, too, haven't done much looking into Kedro-Pandera yet (I've just taken for granted that the integration exists). I'm pretty sure @datajoely also has some degree of interest in all of these working together.

Cc @cosmicBboy for visibility

cosmicBboy commented 1 week ago

+1 to the native ibis-pandera integration!

Of course dataframe conversion might be nice for convenience, but the first-class backend support is ideal for the performance and flexibility benefits of ibis.