apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
328 stars 83 forks source link

Python: missing dependency declaration on pyarrow? #1908

Closed SebAlbert closed 6 days ago

SebAlbert commented 3 weeks ago

What would you like help with?

When installing (via pip) the package adbc_driver_postgresql, I get a runtime error from an import that suggests (and is indeed fixed by) also installing pyarrow via pip:

Traceback (most recent call last):
  File "/...venv/lib/python3.11/site-packages/adbc_driver_manager/dbapi.py", line 42, in <module>
    import pyarrow
ModuleNotFoundError: No module named 'pyarrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "....my_code.py", line 9, in <module>
    from adbc_driver_postgresql import dbapi
  File "/...venv/lib/python3.11/site-packages/adbc_driver_postgresql/dbapi.py", line 25, in <module>
    import adbc_driver_manager.dbapi
  File "/...venv/lib/python3.11/site-packages/adbc_driver_manager/dbapi.py", line 44, in <module>
    raise ImportError("PyArrow is required for the DBAPI-compatible interface") from e
ImportError: PyArrow is required for the DBAPI-compatible interface

Should this not be a declared requirement of the python package in the first place?

On the other hand, is there a more minimal way than installing pyarrow with 40 MB which in turn ties in numpy with another 18 MB? It "feels" quite heavy.

jorisvandenbossche commented 3 weeks ago

As the import error mentions, pyarrow is required for the DBAPI-compatible interface, but you can use the lower-level interface without pyarrow. That's the reason it is not listed as a default required dependency, although this then gives the suboptimal user experience you had, where most users will actually want to have pyarrow available to use the DBAPI itnerface.

But exactly because pyarrow is a quite heavy dependency, as you mention, we want to avoid requiring to pull it in. At the moment there is no more minimal way to install pyarrow using pip (there is work in progress to remove the numpy dependency, and to split the wheel so it is possible to install a more minimal set of pyarrow functionality, but that is not for the short term).

Depending on your use case / what you want to do with the resulting Arrow data from your query, you could look into other Arrow implementations with python bindings such as nanoarrow.

lidavidm commented 3 weeks ago

It's also possible we could provide the dbapi layer using just nanoarrow nowadays/soon.