delta-io / delta-sharing

An open protocol for secure data sharing
https://delta.io/sharing
Apache License 2.0
720 stars 154 forks source link

Support for load_as_pyarrow_dataset or load_as_pyarrow_table #238

Open chitralverma opened 1 year ago

chitralverma commented 1 year ago

This is a new feature request or rather a little refactoring in the code for reader to allow users to read datasets directly as pyarrow datasets and tables.

As you can see here, we are anyways creating the pyarrow dataset and table, which is then used to convert to a pandas DF in the to_pandas method

I would like to refactor this part and expose this as separate functionalities - to_pyarrow_dataset and to_pyarrow_table.

Advantage of this refactoring is that users will then be able to efficiently get the pyarrow things directly without an additional full copy/ conversion to pandas dataframe if required. This will allow the extension of delta-sharing on other processing systems like Datafusion, Polars etc, since they all extensively rely on pyarrow datasets.

Please let me know if this issue makes sense to you, I can raise a PR quick for this in a day or so.

Note: the existing functionalities will remain unaffected by this refactoring.

jacobmarble commented 1 year ago

I was googling "delta sharing polars" and found this issue. 👍