Open marklit opened 1 year ago
Yeah it is not supported yet. We already have our own Parquet reader extension in DuckDB and we are looking into how to integrate that with this extension in a natural way. I haven't tested it, but you should maybe be able to use the parquet extension to load the gpq and simply convert the wkb binary columns into geometries using ST_GeomFromWKB
You can see the supported drivers using SELECT * FROM st_list_drivers()
I should explain: It is not supported because we don't bundle the Arrow library (which provides the parquet driver)
I haven't tested it, but you should maybe be able to use the parquet extension to load the gpq and simply convert the wkb binary columns into geometries using ST_GeomFromWKB
Just confirming that this works, and in fact works really well. The issue I guess is that while you can process your geoparquet file, you just can't save it back to geoparquet. The good news is that some external libraries might treat it as geoparquet anyway. You might lose some infomation, like CRS though. I tested this in the R package geoarrow, and read_geoparquet_sf
will read the file exported from duckdb fine, just doesn't keep the CRS.
Just wanted to echo @rdenham 's comment that it's a real shame to lose all the metadata, especially CRS this way.
This is actually very awkward for designing applications where we care about coordinate reference systems and can't anticipate them ahead of time. It's also very confusing to users who can't easily figure out why duckdb spatial cannot use st_read_meta
on the one spatial vector format that seems 'most native' to duckdb.
Would it be possible to somehow modify the behavior of st_read_meta so that it could use GDAL for that purpose when reading a geoparquet file?
Ill just share that native Geoparquet support is planned to be the next big feature i work on for spatial, im just going to wrap up some refactoring and documrntation work first!
@Maxxen Just to get an idea, do you have an estimate time for when GeoParquet/GeoArrow would be supported in the spatial extension?
I currently have basic writing and reading working, with the "bbox" and "geometry_types" fields in the metadata being properly populated, but CRS handling is blocked since we can't store projection information in the geometry type itself yet and thats a more involved feature for the future as it is going to require a lot of changes in the DuckDB core. Although you can (even today) access the geoparquet metadata using DuckDB's existing parquet_kv_metadata('path')
function
However we are currently busy preparing for the next version of DuckDB scheduled to be released in two weeks and I don't think my changes so far are going to make it in until then as there are more pressing PR's and bugfixes to get in. Ill post an update in this thread once I got initial geoparquet support available on nightly.
Hey- any update on this?
Also interested! We are using geoparquet pervasively at @onefact with our campaigns: https://www.payless.health/payless.health-linknyc-campaign.jpg and geospatial work (https://onefact.github.io/new-york-real-estate/ is one example).
I don't know what Max's plans are but last month I saw a lot of activity around projects trying to add native geometry, spatial indices, more spatial-centric storage to Parquet, ORC, etc... and GPU-friendliness.
If any of the above turns into code and formal specifications at some point, there could be a big upgrade on GeoParquet. Especially since it never got spatial-centric indices.
there could be a big upgrade on GeoParquet. Especially since it never got spatial-centric indices.
To be clear, the upcoming GeoParquet 1.1 includes native support for spatial partitioning
there could be a big upgrade on GeoParquet. Especially since it never got spatial-centric indices.
To be clear, the upcoming GeoParquet 1.1 includes native support for spatial partitioning
So, in essence GeoParquet 1.1 would provide a native spatial index?
I think thats overselling it a bit, but in essence you get bounding box statistics per row group that potentially allow you to skip scanning entire groups if the parquet file is created in such a way so that the rows are spatially correlated.
For DuckDB that means you would have to sort, and provide the expected bounds up front, or do another pass over all the input data to calculate the extent first.
Right, I'd argue the difference lies in spatial "indexing" vs "partitioning", where I consider indexing to mean that the bounding box of every row is known, whereas partitioning means the the bounding box of each chunk is known
And for the record, @Maxxen support for geoparquet 1.1 is coming to duckdb soon right?
Here's the PR for part 1: Minimal GeoParquet 1.0 support.
When the spatial extension is installed and loaded, reading from a geoparquet file through DuckDB's normal parquet functionality will now automatically convert to GEOMETRY
. There's also a new GeoParquet
copy format that will WKB-encode GEOMETRY
columns automatically and write the 2D bbox
and geometry_types
column-level geoparquet metadata.
https://github.com/duckdb/duckdb/pull/12503
There's a bunch of design-decision for handling the cross-extension dependencies here that I expected I'll receive a lot of feedback on, but once that gets resolved moving on to supporting 1.1 should be relatively straight-forward.
This extension was compiled with GDAL 3.6.3 which has support for GeoParquet (it was added in 3.5.1). Any idea why it states for the format is unsupported?
The file itself looks fine.