Open lidavidm opened 2 years ago
These optimizations would probably be difficult to support, though we should preallocate where possible.
ConnectorX's docs compare it to Turbodbc which tends to trail it, though Turbodbc does not appear to implement parallelization (that might explain the difference).
Turbodbc also lists some optimizations: https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html
In particular, it can interleave I/O and conversion. That may be interesting for us, though libpq seems to only either give you a choice between row-at-a-time or getting all query results at once.
Turbodbc also implements some memory optimizations: dictionary-encoding string fields, and dynamically determining the minimum integer width.
@lidavidm for pgeon I have experimented with FETCH instead of COPY. COPY was the fastest method in my [limited] testings
Ah, thanks. I noticed that, and it seems like FETCH also requires you to manage a server-side cursor which isn't great.
In particular, it can interleave I/O and conversion
If you're implementing an async interface, as a trio
user, it would be great if you could use anyio
rather than native acyncio
features. This will enable the code to be used with any async library.
Perhaps the most prominent Python library to support AnyIO is fastapi
, and that's where I'd (eventually) like to make use of adbc
- asynchronously connecting to databases for displaying data in FastAPI dashboards.
Async APIs are somewhere down on the list of things I would like to explore! But the 'base' API is all blocking. (I also haven't tried binding async C/C++ interfaces to Python's async APIs yet - I need to look at whether callbacks, polling, or something else is preferred/ergonomic.)
Thanks for the heads up - I'll make sure to support the broader ecosystem (I quite like trio's ideas, even if I haven't gotten a chance to use it in practice).
This benchmark (in slides) found that the libpq driver is very slow: https://www.clear-code.com/blog/2023/5/8/rubykaigi-2023-announce.html
Is that before or after #636?
FWIW, after that PR you could write benchmarks for reading a raw COPY buffer (i.e., without reading over a connection). Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.
I think that it's "after". The benchmark is https://github.com/apache/arrow-flight-sql-postgresql/tree/main/benchmark/integer .
FYI: The slide URL in the blog post: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2023/
Ah, so the 'libpq' column is https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.c ? In that case I would expect it to be slower by definition since we're doing extra work to convert the result set to Arrow. And the Flight SQL server has an advantage since it can grab the data directly from PostgreSQL without going through the PostgreSQL wire protocol.
Yes.
(https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.sql is the same performance as select.c
.)
Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.
FWIW, this is mentioned in the issue above. I think when I looked at it, it seemed like libpq would read the entire response before returning to you.
Yes, it won't return anything less than one row at a time. But right now we do download -> decode -> download ->decode and we in theory could do
download -> download -> download -> download -> download ->
sync -> decode -> wait sync -> decode
...such that the only time the user pays for is download time. (Probably complicated to get right, though).
Pgeon: https://github.com/0x0L/pgeon ConnectorX: https://sfu-db.github.io/connector-x/intro.html