apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
383 stars 95 forks source link

[C] Research ConnectorX/pgeon for optimizing libpq driver #71

Open lidavidm opened 2 years ago

lidavidm commented 2 years ago

Pgeon: https://github.com/0x0L/pgeon ConnectorX: https://sfu-db.github.io/connector-x/intro.html

lidavidm commented 2 years ago

ConnectorX

These optimizations would probably be difficult to support, though we should preallocate where possible.

Turbodbc

ConnectorX's docs compare it to Turbodbc which tends to trail it, though Turbodbc does not appear to implement parallelization (that might explain the difference).

Turbodbc also lists some optimizations: https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html

In particular, it can interleave I/O and conversion. That may be interesting for us, though libpq seems to only either give you a choice between row-at-a-time or getting all query results at once.

Turbodbc also implements some memory optimizations: dictionary-encoding string fields, and dynamically determining the minimum integer width.

pgeon

0x0L commented 2 years ago

@lidavidm for pgeon I have experimented with FETCH instead of COPY. COPY was the fastest method in my [limited] testings

lidavidm commented 2 years ago

Ah, thanks. I noticed that, and it seems like FETCH also requires you to manage a server-side cursor which isn't great.

dhirschfeld commented 1 year ago

In particular, it can interleave I/O and conversion

If you're implementing an async interface, as a trio user, it would be great if you could use anyio rather than native acyncio features. This will enable the code to be used with any async library.

Perhaps the most prominent Python library to support AnyIO is fastapi, and that's where I'd (eventually) like to make use of adbc - asynchronously connecting to databases for displaying data in FastAPI dashboards.

lidavidm commented 1 year ago

Async APIs are somewhere down on the list of things I would like to explore! But the 'base' API is all blocking. (I also haven't tried binding async C/C++ interfaces to Python's async APIs yet - I need to look at whether callbacks, polling, or something else is preferred/ergonomic.)

Thanks for the heads up - I'll make sure to support the broader ecosystem (I quite like trio's ideas, even if I haven't gotten a chance to use it in practice).

lidavidm commented 1 year ago

This benchmark (in slides) found that the libpq driver is very slow: https://www.clear-code.com/blog/2023/5/8/rubykaigi-2023-announce.html

paleolimbot commented 1 year ago

Is that before or after #636?

FWIW, after that PR you could write benchmarks for reading a raw COPY buffer (i.e., without reading over a connection). Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.

kou commented 1 year ago

I think that it's "after". The benchmark is https://github.com/apache/arrow-flight-sql-postgresql/tree/main/benchmark/integer .

kou commented 1 year ago

FYI: The slide URL in the blog post: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2023/

lidavidm commented 1 year ago

Ah, so the 'libpq' column is https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.c ? In that case I would expect it to be slower by definition since we're doing extra work to convert the result set to Arrow. And the Flight SQL server has an advantage since it can grab the data directly from PostgreSQL without going through the PostgreSQL wire protocol.

kou commented 1 year ago

Yes. (https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.sql is the same performance as select.c.)

lidavidm commented 1 year ago

Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.

FWIW, this is mentioned in the issue above. I think when I looked at it, it seemed like libpq would read the entire response before returning to you.

paleolimbot commented 1 year ago

Yes, it won't return anything less than one row at a time. But right now we do download -> decode -> download ->decode and we in theory could do

download -> download -> download -> download -> download ->
                        sync -> decode -> wait              sync -> decode

...such that the only time the user pays for is download time. (Probably complicated to get right, though).