Hello, I'm experimenting with DuckDB's spatial extension to unionize multiple large spatial datasets with polygon geometries. The datasets include a large number of different attributes, leading to a union query that is really quite long since I'd like to keep all attributes from the input datasets (see attached).
My workflow is to import the GPKG layers directly into DuckDB, build and execute the union query to create a new table eca_layer, then export the result from DuckDB directly to Geopackage format. So far, the union performance is very impressive given the size of the input data layers, but I'd like to write the resulting table "eca_layers" to disk and visually QA it in QGIS.
The main challenge I'm having now is exporting the table from DuckDB to GPKG. I've tried two ways to do this and encountered different problems with both:
Export from DuckDB directly to disk (this is highly preferable as the unionized dataset is quite large and complex)
Retrieve the result as a data frame in R, convert to sf object (could take a long time due to dataset size/complexity), and then write to disk as GPKG
Option 1: Export from DuckDB
I've tried the query below but it writes rather slowly (I suspect this is due to the single-threaded limitation of DuckDB's write functionality mentioned in the documentation), and then it always stops and freezes when the GPKG hits a certain size on disk (135,292KB):
query_export <- "
COPY (SELECT * FROM eca_layer)
TO 'eca_layer.gpkg'
WITH (FORMAT 'GDAL', DRIVER 'GPKG', SRS '3005')
"
dbExecute(con, query_export)
I have tried writing to different formats such as SQLite and Flatgeobuf without success (these queries don't create any file at all!). I have also tried subsetting rows with valid geometries and exporting, but this runs into the same problem as above:
query_subset <- "
COPY (
SELECT *
FROM eca_layer
WHERE ST_IsValid(geometry) = TRUE
)
TO 'eca_layer.gpkg'
WITH (FORMAT 'GDAL', DRIVER 'GPKG', SRS '3005')
"
dbExecute(con, query_subset)
Option 2: Process in R and Export
I've noticed that DuckDB returns result as a data frame, with the geometry column formatted in what appears to be WKB (by the way, is the WKB returned by DuckDB using the format specified by GEOS?). My next step is to convert the data frame into an sf object and write it out as GPKG, but this again returns different errors using the code below:
result <- dbGetQuery(con, "SELECT * FROM eca_layer")
result_sf1 <- st_as_sf(result[1, ], wkt = "geometry")
# returns error:
# OGR: Unsupported geometry type
# Error: OGR error
result_sf2 <- result[1, ] |>
mutate(geometry = st_as_sfc(wkt = "geometry", crs = 3005)) |>
st_as_sf()
# returns error:
# Error in `mutate()`:
# ℹ In argument: `geometry = st_as_sfc(wkt = "geometry", crs = 3005)`.
# Caused by error in `st_as_sfc.character()`:
# ! argument "x" is missing, with no default
# query to check that geometry column is present in result:
# dbGetQuery(con, "SELECT column_name FROM information_schema.columns WHERE table_name = 'eca_layer'")
Ultimately I'd really love to get this option 1 working, but SQL is not my strong suit and I'm unsure how to troubleshoot DuckDB further. Any help or suggestions would be much appreciated at this stage!
Hello, I'm experimenting with DuckDB's spatial extension to unionize multiple large spatial datasets with polygon geometries. The datasets include a large number of different attributes, leading to a union query that is really quite long since I'd like to keep all attributes from the input datasets (see attached).
union_query.txt
My workflow is to import the GPKG layers directly into DuckDB, build and execute the union query to create a new table
eca_layer
, then export the result from DuckDB directly to Geopackage format. So far, the union performance is very impressive given the size of the input data layers, but I'd like to write the resulting table "eca_layers" to disk and visually QA it in QGIS.The main challenge I'm having now is exporting the table from DuckDB to GPKG. I've tried two ways to do this and encountered different problems with both:
Option 1: Export from DuckDB
I've tried the query below but it writes rather slowly (I suspect this is due to the single-threaded limitation of DuckDB's write functionality mentioned in the documentation), and then it always stops and freezes when the GPKG hits a certain size on disk (135,292KB):
I have tried writing to different formats such as SQLite and Flatgeobuf without success (these queries don't create any file at all!). I have also tried subsetting rows with valid geometries and exporting, but this runs into the same problem as above:
Option 2: Process in R and Export
I've noticed that DuckDB returns
result
as a data frame, with the geometry column formatted in what appears to be WKB (by the way, is the WKB returned by DuckDB using the format specified by GEOS?). My next step is to convert the data frame into ansf
object and write it out as GPKG, but this again returns different errors using the code below:Ultimately I'd really love to get this option 1 working, but SQL is not my strong suit and I'm unsure how to troubleshoot DuckDB further. Any help or suggestions would be much appreciated at this stage!