Open kentstephen opened 7 months ago
Hi! Thanks for opening this issue!
This is something I've been thinking about for a while and would be great to support some day. However, scanning the GDF in the same way DuckDB does with regular dataframes "zero copy" might be difficult as (as far as I know) geopandas geometries store pointers into GEOS geometries in memory. The problem is that DuckDB bundles its own version of GEOS, so while we technically could just pass the pointers around there's no guarantee that GEOS represents the geometries the same way across multiple versions or have stable ABI.
So there would have to be a conversion step, most likely to/from WKB (or maybe geoarrow, not sure if thats natively supported in geopandas now), similar to the workaround I imagine you got going. We could definitely look at "hiding" this by adding support for doing the conversion automatically to duckdbs BLOB type (which spatial then can ingest) inside the duckdb python client at some point in the future.
I think it would be great if we could make this conversion modular: DuckDB to Arrow to GeoPandas and vice versa instead of custom-built DuckDB to GeoPandas. Especially if we're able to reuse an import version of https://github.com/duckdb/duckdb_spatial/issues/153. It's fine in the near term to still serialize through WKB and just attach the geoarrow.wkb
metadata onto the column.
There's discussion towards implementing native interop between GeoPandas and GeoArrow in https://github.com/geopandas/geopandas/issues/3156. That will probably get implemented in the next few months but maybe after GeoPandas 1.0. Using GeoArrow natively inside GeoPandas will take longer.
geopandas geometries store pointers into GEOS geometries in memory
Indeed, I don't believe GEOS objects are ABI stable, so you can't reliably share memory between Shapely and DuckDB spatial anyways.
FWIW I'm nearly done with an integration the opposite direction; from DuckDB to GeoArrow/GeoPandas in Python. But having the default .arrow()
expose the unstable GEOMETRY
type makes things a bit harder.
@kylebarron Thats sounds great! Id like to revisit #153 soon, even if support for "additional custom type metadata" is quite a bit away.
In case anyone is visiting this trying to get a GDF into duckdb, I've found the following works:
gdf['geometry'] = gdf['geometry'].to_wkt()
out = duckdb.sql("""
SELECT
*
FROM gdf
""").df()
Then if you want to cast back into a GDF:
gdf['geometry'] = gpd.GeoSeries.from_wkt(gdf['geometry'])
gdf = gpd.GeoDataFrame(gdf)
I know duckdb works well with Pandas DataFrames but I hope for the ability to write SQL the same way on GDFs. As of now, when I write SQL on GDFs it returns this error
NotImplementedException: Not implemented Error: Data type 'geometry' not recognized
This is with spatial installed and loaded. There is a workaround, but it's kind of wonky. Thank you for your time.