cloudera / impyla

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
Apache License 2.0
725 stars 247 forks source link

Any plans to support Apache arrow bindings? #507

Open chitralverma opened 1 year ago

csringhofer commented 1 year ago

You mean extending the cursor to return fetch results in Arrow format instead of the current row oriented way?

I don't know of any plans but it sounds like a good addition to Impyla.

chitralverma commented 1 year ago

You mean extending the cursor to return fetch results in Arrow format instead of the current row oriented way?

I don't know of any plans but it sounds like a good addition to Impyla.

Yes, it will be great if we could have as_pyarrow_table and as_pyarrow_dataset options available some where to return the results as a PyArrow Table (eagerly) or PyArrow Dataset (lazyily) which doing zero-copy.

Khalid-Nowaf commented 5 months ago

I would second this strongly. +1

I'm not a Python guy, but I'm using this since it is the only (client/driver) I know for Impala that is stable and feature-complete. Adding arrow data format support will allow us to wrap it in different languages/systems with minimal cost.

csringhofer commented 5 months ago

Adding a basic implementation similar to as_pandas (https://github.com/cloudera/impyla/blob/a3d80ef353f1bd779ab81166785e40dd2100d712/impala/util.py#L46 ) seems quite simple. I see two things that could make this more complicated:

  1. Performance - calling fetchall() converts the results to row based format from HS2's columnar format. Converting this back to a columnar format like arrow would mean two unnecessary transposition of the result set. Avoiding this overhead is possible but needs more work.
  2. Type conversions (e.g. timestamps, which are returned in HS2 as strings). This adds both complexity and potential performance issues.