duckdb / duckdb-r

The duckdb R package
https://r.duckdb.org/
Other
134 stars 23 forks source link

Issues exporting large spatial table to Geopackage format #174

Open bcaradima opened 6 months ago

bcaradima commented 6 months ago

Hello, I'm experimenting with DuckDB's spatial extension to unionize multiple large spatial datasets with polygon geometries. The datasets include a large number of different attributes, leading to a union query that is really quite long since I'd like to keep all attributes from the input datasets (see attached).

union_query.txt

My workflow is to import the GPKG layers directly into DuckDB, build and execute the union query to create a new table eca_layer, then export the result from DuckDB directly to Geopackage format. So far, the union performance is very impressive given the size of the input data layers, but I'd like to write the resulting table "eca_layers" to disk and visually QA it in QGIS.

The main challenge I'm having now is exporting the table from DuckDB to GPKG. I've tried two ways to do this and encountered different problems with both:

  1. Export from DuckDB directly to disk (this is highly preferable as the unionized dataset is quite large and complex)
  2. Retrieve the result as a data frame in R, convert to sf object (could take a long time due to dataset size/complexity), and then write to disk as GPKG

Option 1: Export from DuckDB

I've tried the query below but it writes rather slowly (I suspect this is due to the single-threaded limitation of DuckDB's write functionality mentioned in the documentation), and then it always stops and freezes when the GPKG hits a certain size on disk (135,292KB):

query_export <- "
COPY (SELECT * FROM eca_layer)
TO 'eca_layer.gpkg'
WITH (FORMAT 'GDAL', DRIVER 'GPKG', SRS '3005')
"

dbExecute(con, query_export)

I have tried writing to different formats such as SQLite and Flatgeobuf without success (these queries don't create any file at all!). I have also tried subsetting rows with valid geometries and exporting, but this runs into the same problem as above:

query_subset <- "
COPY (
SELECT *
FROM eca_layer
WHERE ST_IsValid(geometry) = TRUE
)
TO 'eca_layer.gpkg'
WITH (FORMAT 'GDAL', DRIVER 'GPKG', SRS '3005')
"

dbExecute(con, query_subset)

Option 2: Process in R and Export

I've noticed that DuckDB returns result as a data frame, with the geometry column formatted in what appears to be WKB (by the way, is the WKB returned by DuckDB using the format specified by GEOS?). My next step is to convert the data frame into an sf object and write it out as GPKG, but this again returns different errors using the code below:

result <- dbGetQuery(con, "SELECT * FROM eca_layer")

result_sf1 <- st_as_sf(result[1, ], wkt = "geometry")
# returns error: 
# OGR: Unsupported geometry type
# Error: OGR error

result_sf2 <- result[1, ] |>
    mutate(geometry = st_as_sfc(wkt = "geometry", crs = 3005)) |>
    st_as_sf()
# returns error:
# Error in `mutate()`:
# ℹ In argument: `geometry = st_as_sfc(wkt = "geometry", crs = 3005)`.
# Caused by error in `st_as_sfc.character()`:
# ! argument "x" is missing, with no default

# query to check that geometry column is present in result:
# dbGetQuery(con, "SELECT column_name FROM information_schema.columns WHERE table_name = 'eca_layer'")

Ultimately I'd really love to get this option 1 working, but SQL is not my strong suit and I'm unsure how to troubleshoot DuckDB further. Any help or suggestions would be much appreciated at this stage!

krlmlr commented 6 months ago

Thanks. I can't really help with option 2. For option 1, does the following work:

CREATE TABLE eca_layer_mat AS SELECT * FROM eca_layer

?