Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.05k stars 138 forks source link

Support for Arrow PyCapsule Interface #2504

Open kylebarron opened 1 month ago

kylebarron commented 1 month ago

Is your feature request related to a problem? Please describe.

The Arrow PyCapsule Interface is a new spec to simplify Arrow interop between compiled Python libraries.

For example, you have a daft.from_arrow method but this narrowly allows only pyarrow Table objects https://github.com/Eventual-Inc/Daft/blob/a5badb4a99a47bbe8ae9b7dde9d8d3d10d2faa1c/daft/table/table.py#L102-L104. This would fail on a pyarrow.RecordBatchReader or any non-pyarrow Arrow objects, and requires a pyarrow dependency, which is quite large.

A from_arrow method that looks for the __arrow_c_stream__ method would work out of the box on any Arrow-based Python library implementing this spec. That includes:

and hopefully soon duckdb and polars objects as well (https://github.com/pola-rs/polars/issues/12530).

Implementing the pycapsule interface on daft classes means that any of those other libraries would just work on daft classes.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

I've implemented import and export of the pycapsule interface for arrow objects in pyo3-arrow (in a separate crate for a few reasons).

It looks like daft is still tied to arrow2, but maybe pyo3-arrow is useful as a reference here.

jaychia commented 1 month ago

Very cool. I was just chatting with @paleolimbot about whether Daft could remove its dependency on PyArrow at some point and rely on something lighter-weight to do import/export of arrow data.

Today it's technically possible (we can have PyArrow as an optional dependency for the most part, except for some file writing code where we still rely on PyArrow utilities), but definitely requiring PyArrow at all on the critical path is a heavy dependency.

If I understand correctly, there are a few issues to be fixed here:

  1. On the import side: fix from_arrow to check hasattr("__arrow_c_stream__") and use this API to import Arrow data if possible
  2. On the export side, we can add a __arrow_c_stream__ interface as well to return a stream of arrow data instead of a fully materialized arrow table

Am I understanding this correctly?

paleolimbot commented 1 month ago

I think so!

An example of the import side:

https://github.com/apache/arrow-nanoarrow/blob/2aa2e697fd5d0ef3cd88961bb030f9e760db581b/python/src/nanoarrow/c_array_stream.py#L69-L74

On the export side, you probably want to emulate https://github.com/kylebarron/arro3/blob/227c505714865064389d626c7fe3d79e8bd315c0/pyo3-arrow/src/array.rs#L123-L141 (or something like it using arrow2/the stream, like maybe https://github.com/pola-rs/r-polars/pull/5/ ).

I hope that helps!

kylebarron commented 1 month ago

In case it's useful, I implemented the pycapsule interface for polars in https://github.com/pola-rs/polars/pull/17676 and https://github.com/pola-rs/polars/pull/17693. I figure polars-arrow hasn't diverged too much from arrow2, so it should be pretty similar to implement the pycapsule interface in daft.