duckdb / duckdb_spatial

MIT License
426 stars 32 forks source link

Representation of spatial types on export to `ArrowArrayStream` #153

Open paleolimbot opened 9 months ago

paleolimbot commented 9 months ago

Currently when we export a spatial column to Arrow, we get:

import duckdb
duckdb.sql("LOAD spatial;")
duckdb.sql("SELECT ST_GeomFromText('POINT (0 1)') as geom").to_arrow_table()
#> pyarrow.Table
#> geom: binary
#> ----
#> geom: [[000020000000000000000000010000000000000000000000000000000000F03F]]

With geoarrow-python we can register extension types to propagate CRS and type name through pyarrow machinery:

import geoarrow.pyarrow as ga
array = ga.with_crs(ga.as_wkb(["POINT (0 1)"]), '{<some projjson>}')
array
#> GeometryExtensionArray:WkbType(geoarrow.wkb <{<some projjson>}>)[1]
#> <POINT (0 1)>
array.type.extension_name
#> 'geoarrow.wkb'
array.type.__arrow_ext_serialize__()
#> b'{"crs":"{<some projjson>}"}'

Would it be appropriate to export geometry columns to GeoArrow extension arrays (i.e., with ARROW:extension:name and ARROW:extension:type set according to https://github.com/geoarrow/geoarrow/blob/main/extension-types.md ? It looks like what is exported by default is your internal binary representation (perhaps not for the "native" unserialized types), but I think there are other types that are reencoded on export to Arrow (e.g., boolean columns are bitpacked).

Maxxen commented 9 months ago

Of course! Now that GeoArrow is more mature we should definitely make sure that we can export our geometry types properly to it.

I'm not entirely sure how much we can influence the arrow functionality that duckdb provides in its core from an extension when it comes to custom types, or if that would require some changes to upstream duckdb, but I'll look into it for sure.

paleolimbot commented 9 months ago

Awesome! This may be a little deep into the internals of DuckDB extension land for me to be of much help, but with some scaffold of where to start I would certainly be willing to try!

kylebarron commented 9 months ago

Just chiming in that I would love tighter integration between duckdb spatial and geoarrow! Let me know if I can help; I just don't know C++

Maxxen commented 9 months ago

So I think the main difficulty here is enabling DuckDB's types to carry arbitrary metadata. As of now a "custom" duckdb type is really just a wrapper of another type + an alias. Somehow DuckDB core needs to be able to identify that the custom types we register in spatial should be encoded differently in arrow, and also figure out how to do that encoding.

This is the same problem we have with GeoParquet, we don't want extensions to have dependencies on each other (or even be aware of each other), so all communication has to go through DuckDB itself in a generic way.

I don't think it's too unreasonable to allow custom types to also carry arbitrary key-value data (similar to how parquet/arrow work?) that can be interpreted at different encoding/decoding endpoints. Although for GeoArrow I guess we'll need to be able to communicate how to decode the coordinate data from the binary format, maybe by passing a function pointer as the value for a "meta::arrow::serialization_func" key? I'll have to talk to the rest of the team some more, there might be an easier way to do this at the python level as well.

paleolimbot commented 9 months ago

Let us know! It seems like it would be a huge win for all extensions that implement types to customize Arrow in/out but I know little about the bigger picture!

kylebarron commented 9 months ago

I'll have to talk to the rest of the team some more, there might be an easier way to do this at the python level as well.

Python is a big use case, but it would be great to have C-based Arrow interop as well. Personally I'd love to connect duckdb spatial to the geoarrow-native algorithms I'm developing in Rust (wrapping georust), especially if there's no copy out of duckdb

kylebarron commented 5 months ago

One of the projects I'm working on is lonboard, a geoarrow-based interactive data visualization tool that can handle millions of coordinates. I'd love to have an example of using it with duckdb-spatial, visualizing the result of a query. And since DuckDB attributes are Arrow-like, there should be minimal conversion overhead. But this would be be much easier (and simpler for users) if DuckDB spatial could include GeoArrow metadata on its exports, even if that's just marking the geometry column as geoarrow.wkb.

paleolimbot commented 5 months ago

@Maxxen would it be possible to point to a few relevant places in the code base where these things are handled (e.g., where it gets decided what Arrow type to output, or where where the extension interface would need to be modified to make that happen)? I may have some time to take a stab at this in the next month or so.

Maxxen commented 5 months ago

@paleolimbot Sure! Here's my thoughts

One way to solve this is to introduce some sort of new ArrowConversionExtension object in the DBConfig that extensions can subclass and provide a callback(s) (I guess both ways) to some sort of arrow conversion routine, similar to how e.g. OptimizerExtension works. The arrow conversion code in arrow_converter.cpp could then first try to check if the type has an alias (e.g, is a custom type) and if that alias has a matching ArrowConversionExtension in the current config, accessible from the ClientContext, which is almost always threaded down everywhere. You'd probably have to do this check higher up anyway, the alias is only available on the LogicalType itself, once you enter the switch on LogicalTypeId it's too late. Im not actually too familiar with all the arrow stuff but it all lives inside DuckDB core. Here's a commit from a recent PR where I added Arrow support (and python object conversion) for the new ARRAY type.

However, this would be a hardcoded solution for just Arrow. What if we want to de/serialize our custom type differently in e.g. JSON or Parquet? Im going to work on GeoParquet soon so it's not entirely a what-if. However special casing this for Arrow might not be such a bad idea regardless and could be a good first step for custom type conversion which could be generalized/unified further in the future, so don't let me dissuade you :).

I guess you would maybe need some way to let the users control how this conversion is done though? I could imagine some users just want DuckDB geometries as an arrow array of WKB blobs in some cases, or want to decide if they want interleaved/split coordinate arrays. You could maybe do this with a database option (e.g. SET use_interleaved_arrow_spatial_conversion = TRUE) but it would be nicer to control per-query. Maybe you could pass on an optional list of arguments (kwargs-style) from the arrow conversion table function into the extension callback to interpret as they please. I think that's how our clients that can output arrow work (node, python), they just call a special table function... but im note sure? You would then need to expose those extra optional arguments in the client APIs too.


The bigger problem is that we would like to enable custom types to have their own custom "metadata". This metadata could be used to communicate attributes of types across extensions and core DuckDB. E.g. DuckDB could inspect this metadata when encountering an unknown custom type to see if it contains a specially labeled entry to a function pointer to a custom arrow conversion function. But this gets complicated as soon as you want to attach stuff to the type itself that's not easily persistable (like a function pointer) as you now need to defer the deserialization/binding of the type to after the required extension is loaded and can provide this ephemeral data again.

I did do some work on general custom metadata in this branch a while back but ran into a dead-end regarding persistence. I did talk to Mark about it and we sketched out a better plan for it but I haven't touched the branch since as I got busy with other stuff.

In that branch I was mostly concerned with using type metadata to enable custom types to have "parameters", e.g. having the geometry type optionally be parameterized over SRID GEOMETRY(EPGS:4326). We'd need to be a somewhat careful when designing how the casting rules should interpret differently parameterized type, but that's another issue.

paleolimbot commented 5 months ago

Thank you for the thoughts! In the next few weeks I'll do some poking and see if I can come up with anything useful.

Maxxen commented 2 months ago

Hey @paleolimbot did you end up working on this? Im thinking of revisiting this soon and have some fresh ideas on how to make this happen without blowing up the scope with all the custom type metadata stuff

paleolimbot commented 2 months ago

I did not end up working on this despite really wanting to! I am happy to help review (and will have more time after nanoarrow 0.5.0 is released in a week or two).

Maxxen commented 2 months ago

Great! Ill try to give it a shot soon then and ping you once I got something!