Open sehHeiden opened 4 weeks ago
Thanks for opening this issue! Based on our automated check, it seems that your post contains some code but it does not use code blocks to format it.
Please double-check your post and revise it if necessary. To employ syntax highlighting, it's recommended to use code blocks with triple backticks, e.g.:
```sql
SELECT ...
If this is a false positive, feel free to disregard this comment.
Hi!
DuckDB spatial (like almost all geospatial libraries) uses GDAL to write FGB files. We don't know where, how and when GDAL creates temporary files so it's not really feasible for us to clean them up after a crash.
With that said I also don't think you will see any meaningful differences between DuckDBs FGB output performance compared to other systems, as we all use GDAL under the hood. In essence, you'd only be benchmarking how fast DuckDB can convert geometries into GDAL features, but thats not making the most out of duckdb as it is a single-threaded row-by-row process. I'd actually expect e.g. Geopandas to be a lot faster at doing that now because they can pass geoarrow arrays directly to GDAL while we still construct the geometries one at a time.
I want to make clear, that these temporary files are neither deleted upon crash nor on happy path of hyperfine.
But in the end it should not matter, as when hyperfine crashes on the x-th run, than the python code closed for each previous run successfully.
To the off-top of profiling, fgb. As opening and saving files are an integral part of the analysis, I need to be profiled two. I only used fgb, as I could not write to gpkg. I also wrote to geoparquet and that was really fast, I think about 30 % even faster than GeoPandas.
What happens?
I tried to benchmark the spatial Plugin.
With hyperfine I did 13 repetitions.
Each creating a fgb-file. The open space on drive C: was reduced heavily. Benchmark crashed, when no save left.
Found out that on the Temp dir a copy of each output run was stored und never freed.
See Code:
https://github.com/sehHeiden/geospeed/blob/master/geospeed/duckdb_speed.py
To Reproduce
"""Test the execution speed for an equivalent of Geopandas Overlay with intersection using a spatial index."""
import time from pathlib import Path
import duckdb
save_name = None con = duckdb.connect(save_name if save_name else ":memory:", config={"threads": 15, "memory_limit": "20GB"}) con.install_extension("spatial") con.load_extension("spatial")
con.sql("DROP TABLE IF EXISTS buildings") con.sql("DROP TABLE IF EXISTS parcels") con.sql("DROP TABLE IF EXISTS buildings_intersection") con.sql("DROP VIEW IF EXISTS intersections")
Check supported formats for writing
formats = con.sql("SELECT short_name FROM ST_Drivers() WHERE can_create;").fetchall() print(f"Writable formats: {formats}") start = time.time()
shapefile_dir = Path("./ALKIS") # Base directory building_files = list(shapefile_dir.glob("/GebauedeBauwerk.shp")) # Glob pattern for the subdirectories parcel_files = list(shapefile_dir.glob("/NutzungFlurstueck.shp")) # Glob pattern for the subdirectories
Error handling if no files are found
if not building_files or not parcel_files: load_error_txt = "No shapefiles found in the provided directory." raise FileNotFoundError(load_error_txt)
Create then insert
con.sql(f"CREATE TABLE buildings AS SELECT FROM ST_Read('{building_files[0].resolve()!s}');") # noqa: S608 con.sql(f"CREATE TABLE parcels AS SELECT FROM ST_Read('{parcel_files[0].resolve()!s}');") # noqa: S608 con.execute( """PREPARE insert_buildings_stmt AS INSERT INTO buildings SELECT FROM ST_Read($1) WHERE oid NOT IN (SELECT oid FROM parcels);""" ) con.execute( """PREPARE insert_parcels_stmt AS INSERT INTO parcels SELECT FROM ST_Read($1) WHERE oid NOT IN(SELECT oid FROM parcels);""" )
Iterate over the found shapefiles and load them into DuckDB
for building_file, parcel_file in zip(building_files[1:], parcel_files[1:], strict=True):
Insert into the existing table for subsequent shapefiles
Make the data valid
Make geometries valid
con.sql(""" UPDATE buildings SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom); UPDATE parcels SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom); """)
create indexes
con.sql("CREATE INDEX buildings_idx ON buildings USING RTREE (geom);") con.sql("CREATE INDEX parcels_idx ON parcels USING RTREE (geom);") print(f"DuckDB: Loading data takes: {(time.time() - start):.0f} s.")
Perform intersection
time_intersection = time.time() con.sql(""" CREATE TABLE buildings_intersection AS SELECT ST_Intersection(buildings.geom, parcels.geom) as geom, buildings.oid AS building_oid, parcels.oid AS parcel_oid FROM buildings, parcels WHERE ST_Intersects(buildings.geom, parcels.geom); """)
Drop the indexes and unnecessary columns
con.sql(""" DROP INDEX buildings_idx; DROP INDEX parcels_idx; ALTER TABLE buildings DROP COLUMN geom; ALTER TABLE parcels DROP COLUMN geom; """)
Create final intersections table
con.sql(""" CREATE VIEW intersections AS SELECT * FROM buildings_intersection AS bi JOIN buildings AS bs ON bi.building_oid = bs.oid JOIN parcels AS ps ON bi.parcel_oid = ps.oid; """) con.sql("""UPDATE buildings_intersection SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);""") print(f"DuckDB: Intersection takes: {(time.time() - time_intersection):.0f} s.")
if not save_name:
Save the result to a file
print(f"DuckDB: Total duration: {(time.time() - start):.0f} s.")
OS:
Windows
DuckDB Version:
1.1
DuckDB Client:
Python
Hardware:
30 GB free on C:, Ryzen 5800x 40 GB RAM
Full Name:
Sebastian Heiden
Affiliation:
Delphi-IMM, But this usage of DuckDB is private
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?