JakobGM / patito

A data modelling layer built on top of polars and pydantic
MIT License
270 stars 23 forks source link

Conversion between Polars -> Patito DataFrames and back #10

Open alexthoden opened 1 year ago

alexthoden commented 1 year ago

The functionality of this packages is awesome, but for the use case my team and I have, it's rendered essentially useless due to the fact that patito.polars.DataFrames can't be reverted back to polars.polars.DataFrames. This feature would be a huge help!

cbb330 commented 1 year ago

Patito offers patito.DataFrame, a class that extends polars.DataFrame in order to provide utility methods related to patito.Model. The schema of a data frame can be specified at runtime by invoking patito.DataFrame.set_model(model), after which a set of contextualized methods become available:

The two types seem to be doing different things, a patito dataframe exposes APIs related to schema validation and management of the data in relation to schema. While a polars dataframe is for transformations selections and input/output.

The true problem seems to be that they should have a different name to describe this distinction better.

But for you, I'm wondering what code you have that cannot work around this distinction?

alexthoden commented 1 year ago

I apologize for this, I'm relatively new to pyarrow and didn't realize it maintains dtypes through to polars. The use case my team and I have is using Patito as a data validation medium prior to transformation of our data. We often struggle with polars joining and exploding on columns that have erroneous data types so we were looking for a solution to quickly validate data, then hand it off to the next process for transformation, while maintaining the type castings. I did not realize I could merely convert to arrow, then back to polars after validation with patito. Thanks for your response, and I apologize again for not doing proper research before asking this question!

ion-elgreco commented 11 months ago

I ran into DataFrame subclassing causing an issue. Quickest zero copy I guess is to do this: pl.DataFrame(Model.examples().to_arrow())

GeorgePearse commented 1 month ago

I also think this would be useful (or at least cleaner code wise), and pretty straightforward to implement?

Then patito is polars but with pydantic validation, which I feel is a very clean thing to describe to users.

image

^ Bits like this are just a very nice functionality add on top of polars, not so much for validation.