ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
99 stars 6 forks source link

Implement Arrow PyCapsule Interface & make pyarrow optional dependency #268

Open kylebarron opened 1 month ago

kylebarron commented 1 month ago

The Arrow project recently created a new protocol for sharing Arrow data in Python. One of the goals of the protocol is allow exporting / importing Arrow data in Python without having to necessarily use PyArrow as an intermediary.

This allows Arrow-exportable objects to be recognized based on the presence of one of several dunder methods.

A growing number of Python-Arrow libraries are aware of the PyCapsule interface, and then would be able to read from fastexcel directly, without needing to go through pyarrow or even have it installed in the environment.

For example, I have a PR open for polars in https://github.com/pola-rs/polars/pull/17693, but you could also pass the fastexcel object directly into constructors from pyarrow, nanoarrow, arro3. I'm advocating for more projects to adopt the PyCapsule interface directly, including duckdb, datafusion, vegafusion, and daft.

In terms of implementation, currently fastexcel uses arrow-rs' default pyarrow integration. Instead you need to define one or more dunder methods, probably on the ExcelSheet. If you always return a RecordBatch, then you could implement __arrow_c_array__, but if you ever wanted to expose a lazy stream, you could implement __arrow_c_stream__, which would export multiple batches of data.

I have a helper library, pyo3-arrow, that you can use to implement this, separate from arrow-rs for a few reasons. Or the relevant code is pretty small and self contained to vendor if you don't want to add an external dependency.

lukapeschke commented 1 month ago

Thanks for the heads-up, I'll try to look into this when I have the time :slightly_smiling_face: