duckdb / duckdb_spatial

MIT License
414 stars 32 forks source link

GDF support #311

Open kentstephen opened 2 months ago

kentstephen commented 2 months ago

I know duckdb works well with Pandas DataFrames but I hope for the ability to write SQL the same way on GDFs. As of now, when I write SQL on GDFs it returns this error NotImplementedException: Not implemented Error: Data type 'geometry' not recognized This is with spatial installed and loaded. There is a workaround, but it's kind of wonky. Thank you for your time.

Maxxen commented 2 months ago

Hi! Thanks for opening this issue!

This is something I've been thinking about for a while and would be great to support some day. However, scanning the GDF in the same way DuckDB does with regular dataframes "zero copy" might be difficult as (as far as I know) geopandas geometries store pointers into GEOS geometries in memory. The problem is that DuckDB bundles its own version of GEOS, so while we technically could just pass the pointers around there's no guarantee that GEOS represents the geometries the same way across multiple versions or have stable ABI.

So there would have to be a conversion step, most likely to/from WKB (or maybe geoarrow, not sure if thats natively supported in geopandas now), similar to the workaround I imagine you got going. We could definitely look at "hiding" this by adding support for doing the conversion automatically to duckdbs BLOB type (which spatial then can ingest) inside the duckdb python client at some point in the future.

kylebarron commented 2 months ago

I think it would be great if we could make this conversion modular: DuckDB to Arrow to GeoPandas and vice versa instead of custom-built DuckDB to GeoPandas. Especially if we're able to reuse an import version of https://github.com/duckdb/duckdb_spatial/issues/153. It's fine in the near term to still serialize through WKB and just attach the geoarrow.wkb metadata onto the column.

There's discussion towards implementing native interop between GeoPandas and GeoArrow in https://github.com/geopandas/geopandas/issues/3156. That will probably get implemented in the next few months but maybe after GeoPandas 1.0. Using GeoArrow natively inside GeoPandas will take longer.

geopandas geometries store pointers into GEOS geometries in memory

Indeed, I don't believe GEOS objects are ABI stable, so you can't reliably share memory between Shapely and DuckDB spatial anyways.

FWIW I'm nearly done with an integration the opposite direction; from DuckDB to GeoArrow/GeoPandas in Python. But having the default .arrow() expose the unstable GEOMETRY type makes things a bit harder.

Maxxen commented 2 months ago

@kylebarron Thats sounds great! Id like to revisit #153 soon, even if support for "additional custom type metadata" is quite a bit away.