duckdb / duckdb_spatial

MIT License
489 stars 40 forks source link

GeoParquet Support? #27

Open marklit opened 1 year ago

marklit commented 1 year ago

This extension was compiled with GDAL 3.6.3 which has support for GeoParquet (it was added in 3.5.1). Any idea why it states for the format is unsupported?

$ /Volumes/Seagate/duckdb_spatial/build/debug/duckdb -unsigned test.duckdb
LOAD '/Volumes/Seagate/duckdb_spatial/build/debug/extension/spatial/spatial.duckdb_extension';
select * from st_read('/Volumes/Seagate/open5g_data/microsoft_roads_ai/Oceania_AUS.gpq') limit 1;
ERROR 4: `/Volumes/Seagate/open5g_data/microsoft_roads_ai/Oceania_AUS.gpq' not recognized as a supported file format.
Error: IO Error: Could not open file: /Volumes/Seagate/open5g_data/microsoft_roads_ai/Oceania_AUS.gpq (`/Volumes/Seagate/open5g_data/microsoft_roads_ai/Oceania_AUS.gpq' not recognized as a supported file format.)

The file itself looks fine.

$ ogrinfo microsoft_roads_ai/Oceania_AUS.gpq
INFO: Open of `microsoft_roads_ai/Oceania_AUS.gpq'
      using driver `Parquet' successful.
1: Oceania_AUS (Line String)
Maxxen commented 1 year ago

Yeah it is not supported yet. We already have our own Parquet reader extension in DuckDB and we are looking into how to integrate that with this extension in a natural way. I haven't tested it, but you should maybe be able to use the parquet extension to load the gpq and simply convert the wkb binary columns into geometries using ST_GeomFromWKB

You can see the supported drivers using SELECT * FROM st_list_drivers()

Maxxen commented 1 year ago

I should explain: It is not supported because we don't bundle the Arrow library (which provides the parquet driver)

rdenham commented 1 year ago

I haven't tested it, but you should maybe be able to use the parquet extension to load the gpq and simply convert the wkb binary columns into geometries using ST_GeomFromWKB

Just confirming that this works, and in fact works really well. The issue I guess is that while you can process your geoparquet file, you just can't save it back to geoparquet. The good news is that some external libraries might treat it as geoparquet anyway. You might lose some infomation, like CRS though. I tested this in the R package geoarrow, and read_geoparquet_sf will read the file exported from duckdb fine, just doesn't keep the CRS.

cboettig commented 8 months ago

Just wanted to echo @rdenham 's comment that it's a real shame to lose all the metadata, especially CRS this way.

This is actually very awkward for designing applications where we care about coordinate reference systems and can't anticipate them ahead of time. It's also very confusing to users who can't easily figure out why duckdb spatial cannot use st_read_meta on the one spatial vector format that seems 'most native' to duckdb.

Would it be possible to somehow modify the behavior of st_read_meta so that it could use GDAL for that purpose when reading a geoparquet file?

Maxxen commented 8 months ago

Ill just share that native Geoparquet support is planned to be the next big feature i work on for spatial, im just going to wrap up some refactoring and documrntation work first!

ncclementi commented 6 months ago

@Maxxen Just to get an idea, do you have an estimate time for when GeoParquet/GeoArrow would be supported in the spatial extension?

Maxxen commented 6 months ago

I currently have basic writing and reading working, with the "bbox" and "geometry_types" fields in the metadata being properly populated, but CRS handling is blocked since we can't store projection information in the geometry type itself yet and thats a more involved feature for the future as it is going to require a lot of changes in the DuckDB core. Although you can (even today) access the geoparquet metadata using DuckDB's existing parquet_kv_metadata('path') function

However we are currently busy preparing for the next version of DuckDB scheduled to be released in two weeks and I don't think my changes so far are going to make it in until then as there are more pressing PR's and bugfixes to get in. Ill post an update in this thread once I got initial geoparquet support available on nightly.

ppasquet commented 5 months ago

Hey- any update on this?

jaanli commented 5 months ago

Also interested! We are using geoparquet pervasively at @onefact with our campaigns: https://www.payless.health/payless.health-linknyc-campaign.jpg and geospatial work (https://onefact.github.io/new-york-real-estate/ is one example).

marklit commented 5 months ago

I don't know what Max's plans are but last month I saw a lot of activity around projects trying to add native geometry, spatial indices, more spatial-centric storage to Parquet, ORC, etc... and GPU-friendliness.

If any of the above turns into code and formal specifications at some point, there could be a big upgrade on GeoParquet. Especially since it never got spatial-centric indices.

kylebarron commented 5 months ago

there could be a big upgrade on GeoParquet. Especially since it never got spatial-centric indices.

To be clear, the upcoming GeoParquet 1.1 includes native support for spatial partitioning

ppasquet commented 5 months ago

there could be a big upgrade on GeoParquet. Especially since it never got spatial-centric indices.

To be clear, the upcoming GeoParquet 1.1 includes native support for spatial partitioning

So, in essence GeoParquet 1.1 would provide a native spatial index?

Maxxen commented 5 months ago

I think thats overselling it a bit, but in essence you get bounding box statistics per row group that potentially allow you to skip scanning entire groups if the parquet file is created in such a way so that the rows are spatially correlated.

For DuckDB that means you would have to sort, and provide the expected bounds up front, or do another pass over all the input data to calculate the extent first.

kylebarron commented 5 months ago

Right, I'd argue the difference lies in spatial "indexing" vs "partitioning", where I consider indexing to mean that the bounding box of every row is known, whereas partitioning means the the bounding box of each chunk is known

jatorre commented 5 months ago

And for the record, @Maxxen support for geoparquet 1.1 is coming to duckdb soon right?

Maxxen commented 5 months ago

Here's the PR for part 1: Minimal GeoParquet 1.0 support.

When the spatial extension is installed and loaded, reading from a geoparquet file through DuckDB's normal parquet functionality will now automatically convert to GEOMETRY. There's also a new GeoParquet copy format that will WKB-encode GEOMETRY columns automatically and write the 2D bbox and geometry_types column-level geoparquet metadata.

https://github.com/duckdb/duckdb/pull/12503

There's a bunch of design-decision for handling the cross-extension dependencies here that I expected I'll receive a lot of feedback on, but once that gets resolved moving on to supporting 1.1 should be relatively straight-forward.