Open rouault opened 1 month ago
For good measure CC @ptitjano as a QDuckDB contributor (I unfortunately don't know the github handle of Florent Fougères)
Including @krlmlr maintainer of duckdb R package 🙏
This would be awesome! I suspect it might be hard for OGR to get as fast as DuckDB's parquet stuff, as DuckDB is really good at optimizing to take advantage of all the cores, and they've spent a ton of time making parquet parsing work really, really well.
I was thinking of a much more lightweight integration, which is just supporting DuckDB with a vector driver. I'm usually using DuckDB to go in and out of Parquet, but sometimes it's nice to just leave it as a duckdb database, and it'd be nice to just throw that directly into QGIS. Like don't worry about getting the arrow stream with python to start - just connect to it like you do directly to other databases / formats. Not sure if that helps with any derisking of the core stuff, but it's definitely something that would be useful to me.
@nyalldawson because excited folks are wanting QGIS to Quack 😃
Thank you @rouault for the ping.
Indeed, we, Oslandia, have been developing a QGIS python plugin (QDuckdb) for one of your clients. It should work well and it offers a straightforward way to load and use a DuckDB
database in QGIS.
However, we have some reached some limits from a performance perspective with the python QGIS overhead. We have been able to solve of them but we are limited by the architecture of a QGIS Provider which relies on an iterator and returns the rows of a table one by one. This is really costly on the python side. I think it would be very interesting to add support for the Arrow Stream (or a similar one if this one is really deprecated) which would allow to take full advantage of the DuckDB
architecture.
Before making the python plugin, we asked ourselves if we could directly write a C++ provider in QGIS core. However, we would have faced some difficulties, some of them already mentioned by @rouault:
DuckDB
packaging is still limited at the moment (In the python plugin, we embed the duckdb dependency to solve this issue)DuckDB
was still in the 0.x days and it would be much easier to make evolve a python provider in case of any API breakAs a side note, the few interactions we had with the DuckDB
community (mostly a few issues on github and some discord chats) have been very positive and everyone has been helpful and positive. Thank you!
I think, that an OGR driver would be a great addition, and we, Oslandia, would be very much interested to help move it forward.
cc @florentfgrs (Florent Fougères)
Im out drinking beer. Ill respond tomorrow but I have a lot of thoughts about this!
I was thinking of a much more lightweight integration, which is just supporting DuckDB with a vector driver.
Obviously! I jumped right away to the stretch goal of having a DuckDB SQL dialect, but that's clearly the ultimate step of the integration and possibly not easily reachable (or just not delivering stellar performance depending on its requirements on the input source). The traditional OGR driver has for sure to come first, reading both DuckDB native database, and cherry-on-the-cake Parquet files (or CSV etc). For that my mention of using the ArrowStream interface is actually a way of having a very small code base, because I've already coded the generic bidirectionnal OGRFeature <--> ArrowArray convertor in past GDAL versions, and which I've leveraged in my toy/demo OGRPyDuckDB prototype. So if the ArrowStream interface of DuckDB is available through C or C++ (the use of the Python duckdb package was because it provided a """convenient""" way of getting an ArrowStream, ahem pending embedding Python in C++, but there's precedent in GDAL. Anyway that's definitely not the target architecture I'm thinking about) and that getting data through the ArrowStream interface leads to optimal or near optimal performance. My naive thoughts are that DuckDB being columnar by nature, that should in theory be a cheap adaptation layer, but reality might be more complex if DuckDB uses optimized representations (which I believe they do), but only experts on DuckDB core could confirm/infirm that. Otherwise we might have to use its specific API to get the best of the performance, sacrificing a bit the convenience of the implementation. But for now, all points under "not all roses" section should be considered as blockers to solve for the effort to succeed.
Im out drinking beer.
Enjoy it and your week end. No emergency. If I've labelled this ticket as "rumination", this is a hint that on my side at least there's no identified funding vehicle to make that happen.
As minor contributor to QDuckDB but really impressed by the DuckDB project, I'm excited to see its support coming in GDAL!
What could be very cool, but I don't know if duckdb allows it through the API, would be to have a "DuckDB SQL" dialect, similar to our current "SQLite SQL" dialect[...]
Exactly what I would expect!
Glad to see interest in integrating DuckDB into GDAL. Can't wait to see what happens next!
Sooo... argh, where to begin? I guess ill just start by responding to some of the points brought up:
What could be very cool, but I don't know if duckdb allows it through the API, would be to have a "DuckDB SQL" dialect, similar to our current "SQLite SQL" dialect.
Potentially, we're experimenting with "parser extensions", but there are already some extensions that do add new syntax, like PRQL and duckdb-pgq. Although these integrate very deeply with DuckDB's unstable c++ internals so I don't know how sustainable doing something similar right now would be in the long-term.
one major point is the packaging story of DuckDB.
Yes, iirc we at one point attempted to ship a pre-built static binary of the duckdb library to VCPKG, but it was rejected as they did not like us vendoring our own version of third party dependencies. Im not sure if e.g. Debian have similar qualms but I would expect so.
One thing that could be interesting for GDAL is to use DuckDB's amalgamation build, which gives you a single huge c++ file + header of the entire duckdb source code (sqlite style). This is how we've dealt with distributing DuckDB in a few of the client libraries (e.g. swift, golang). We also include the parquet extension in the amalgamation build by default, and it should be possible to embed others as well with some modification to the generation script. The downside of this is of course that you may end up needing a pretty beefy machine to actually compile it.
I have been confused by the DuckDB C API
We've made a lot of changes to the way result sets are handled In the C-API and the plan is to replace the current arrow api with something better in the future (that is also more performant/doesnt have to materialize everything). We ended up marking things as deprecated even though there are no replacements yet with the idea that it is better to be early with this sort of stuff, but you're not alone in finding that confusing haha.
another potential pain point is that DuckDB Spatial extension statically links GDAL.
I think this is probably the biggest challenge regarding integrating DuckDB in GDAL. There's a couple of different angles to this.
One idea is to consider integrating DuckDB only for its parquet capabilities. But even if we ignore the spatial extension, there is an implicit dependency between the two when reading/writing GeoParquet. Basically, the parquet extension will look through the DuckDB catalog to find the ST_GeomFromWKB
, ST_AsWKB
, ST_Extent
and ST_ZMFlag
functions. Although it is possible to do the conversion and metadata reading from outside DuckDB when reading, but it does get a lot more complicated when writing as you would have to perform the extent/geometry type aggregation separately and somehow slap it on to the parquet footer after the file is written...?. Alternatively we could also provide these required functions separately in a sort of statically linked "mini" spatial extension just to enable DuckDB's geoparquet conversion for GDAL.
Another approach, that I've had in the back of my mind as one of the long-term goals for DuckDB spatial is to remove GDAL from the spatial extension itself, and ship a spatial aware DuckDB driver for GDAL, thus "inverting" the dependency. This is a pretty radical move, (perhaps, equally radical as embedding GDAL in the first place) but I do think it makes a lot of sense.
st_read
function to import/export only a handful of vector formats. (FGB, SHP, GeoJSON). We already have an experimental native shapefile reader and I don't think FGB or GeoJSON are unfeasible to add as well.For users that still want to use GDAL for some of the more obscure formats, we could create a GDAL extension that is separate from spatial, perhaps using DuckDB's new stable C-extension-API, (which leads to MUCH smaller extension binaries and build times) as well as benefit from future build improvements in GDAL.
The main things that need to happen for this is:
One question would be how to deal with PROJ/ the proj database, as that would need most likely still need to be embedded in both spatial and the GDAL extension (and maybe will also be embedded in GDAL itself in the future), and it feels a bit silly to have this big 9mb blob duplicated between all of these binaries. But we are currently thinking of different ways to slim down the PROJ database we ship and/or make it configurable/downloadable at runtime anyway - we're starting to get other DuckDB extensions that want to deal with external resource files at runtime.
Another approach, that I've had in the back of my mind as one of the long-term goals for DuckDB spatial is to remove GDAL from the spatial extension itself
Maybe it goes without saying, but I don't want this to reflect negatively on GDAL. Embedding GDAL into spatial was a fantastic way to make the spatial extension really useful from the get-go and generated a lot of attention for us early on. It was only really feasible thanks to the amazing work championed in https://github.com/OSGeo/gdal/pull/5830. But as both DuckDB and spatial has matured the last two years it would be awesome if we could not just benefit from GDAL but also be part of what makes GDAL great!
The downside of this is of course that you may end up needing a pretty beefy machine to actually compile it.
I would expect most packagers to actively hate that, even more than the vendored DuckDB build, and just remove the amalgamated DuckDB-within-GDAL, or not enable it, which would defeat the purpose. The packager-friendly solution would be for DuckDB to offer the possibility to link against "system" / external versions of its dependencies, at least for those that are already packaged (there's obvioulsy a fair amount of work to do to be able to link against external libs). Packagers don't like vendored components because that's a nightmare to deal with when there is a security flaw in one of them: they have to identify all software that vendorize it and patch them.
https://github.com/OSGeo/gdal/pull/11003 adds a ADBC driver that can use libduckdb to read Parquet or DuckDB datasets. Before you get too excited, no spatial support for now :-) but should be doable as follow-up. This will be GDAL 3.11 material, as we are too close to the imminent 3.10 feature freeze
and ship a spatial aware DuckDB driver for GDAL, thus "inverting" the dependency.
Just a note that the reverse is also possible (export an ADBC driver init function from GDAL that wraps OGR). I am not sure that would help today because DuckDB I don't believe can federate via ADBC, but in theory it could work quite nicely with one copy of GDAL and one copy of DuckDB (e.g., installed via pip) registered with the Python driver manager (or R, or some application embedding both).
ADBC driver now merged into GDAL master / 3.11.0dev: https://gdal.org/en/latest/drivers/vector/adbc.html
Latest ghcr.io/osgeo/gdal
Docker image incorporates libduckdb.so, so you can do:
$ docker run --rm -it -v $HOME:$HOME ghcr.io/osgeo/gdal ogrinfo -if ADBC \
$PWD/autotest/ogr/data/parquet/poly.parquet \
-sql "select st_astext(geometry), * from poly" \
-oo PRELUDE_STATEMENTS="load spatial"
INFO: Open of `/home/even/gdal/gdal/build_cmake/autotest/ogr/data/parquet/poly.parquet'
using driver `ADBC' successful.
Layer name: RESULTSET
Geometry: None
Feature Count: 10
Layer SRS WKT:
(unknown)
st_astext(geometry): String (0.0)
AREA: Real (13.3)
EAS_ID: Integer64 (0.0)
PRFEDEA: String (0.0)
geometry: Binary (0.0)
geometry_bbox.xmin: Real(Float32) (0.0)
geometry_bbox.ymin: Real(Float32) (0.0)
geometry_bbox.xmax: Real(Float32) (0.0)
geometry_bbox.ymax: Real(Float32) (0.0)
OGRFeature(RESULTSET):0
st_astext(geometry) (String) = POLYGON ((479819.84375 4765180.5, 479859.875 4765270, 479909.875 4765370, 479980.21875 4765409.5, 480019.71875 4765319.5, 480059.90625 4765239.5, 480088.8125 4765139.5, 480082.96875 4765049.5, 480080.28125 4764979.5, 480133.96875 4764856.5, 480389.6875 4764950, 480365 4765015.5, 480202.28125 4765482, 480159.78125 4765610.5, 480035.34375 4765558.5, 480039.03125 4765539.5, 479730.375 4765400.5, 479647 4765369.5, 479690.1875 4765259.5, 479819.84375 4765180.5))
AREA (Real) = 215229.266
EAS_ID (Integer64) = 168
PRFEDEA (String) = 35043411
geometry (Binary) = 0204000000000000E033EA487169914AB690EA48556F914A02000000010000001400000000000000000000602F491D41000000207F2D524100000080CF491D4100000080952D524100000080974A1D4100000080AE2D5241000000E0B04B1D4100000060B82D5241000000E04E4C1D41000000E0A12D5241000000A0EF4C1D41000000E08D2D524100000040634D1D41000000E0742D5241000000E04B4D1D41000000605E2D524100000020414D1D41000000E04C2D5241000000E0174E1D41000000202E2D5241000000C016521D4100000080452D524100000000B4511D41000000E0552D524100000020294F1D4100000080CA2D5241000000207F4E1D41000000A0EA2D5241000000608D4C1D41000000A0DD2D5241000000209C4C1D41000000E0D82D524100000080C9471D4100000020B62D5241000000007C461D4100000060AE2D5241000000C028471D41000000E0922D5241000000602F491D41000000207F2D5241
geometry_bbox.xmin (Real(Float32)) = 479647.0
geometry_bbox.ymin (Real(Float32)) = 4764856.5
geometry_bbox.xmax (Real(Float32)) = 480389.69
geometry_bbox.ymax (Real(Float32)) = 4765610.5
[... snip ...]
(the driver is still not spatial, but with the PRELUDE_STATEMENTS="load spatial" open option you load the spatial extension and can use its functions)
I've been fairly impressed by my testing of DuckDB. The speed of their SQL engine is really impressive. And they have a very impressive Parquet reader too (or at least the interaction with it through SQL shines). I don't know if I did something wrong with the Arrow Parquet reader (I don't think so ..?!?), but it is obvious that querying Parquet through DuckDB is much faster than through OGR SQL + Arrow Parquet, I believe especially if the result set is small compared to the table size, despite my attempts at translating OGR SQL to Arrow filtering capabilities, or that might require even further effort to fully leverage its capabilities. I dunno.
What could be very cool, but I don't know if duckdb allows it through the API, would be to have a "DuckDB SQL" dialect, similar to our current "SQLite SQL" dialect, where we could for example expose a OGR layer as an ArrowStream and make DuckDB process the SQL statement, and get back the result (obviously the DuckDB spatial extension is able to do that, using probably some internal API available to extensions).
That said, it's not all roses:
__arrow_c_stream__
to retrieve an ArrowStream and use it on the GDAL C++ side . It's there: https://github.com/rouault/gdal/commit/422220b599546b00dc0860cf59c2493fe5fd6b4a . It works ... as a prototype (note: I didn't do anything spatial in it) ... but my findings are that there might an inefficiency in the way DuckDB maps it "native model" to Arrow, as it seems to fully "materialize" the whole result set and lacks an easy way to specify the "page size", and my impression was even with its huge default paging, which can be an issue when their size exceed available RAM. Apologies if my conclusions are wrong (lots of new source code to figure out...) Whereas using duckdb-the-binary, I've not observed that behavior and when playing with ulimit -v, I could easily get result sets largely exceeding the allocated virtual memory. So in the current state of things, to get nice speed, one should probably implement the native DuckDB result set API, which makes things less appetizing than using the ArrowStream oneI don't have further concrete plans for now. Just a memory dump that might raise interest or dismay :-)
Perhaps this is not a good idea after all, from an "overall ecosystem point of view", although having speedy Parquet SQL in OGR itself is very tempting ...
CC FYI @jorisvandenbossche @Maxxen @dbaston