Open johngcrowley opened 1 week ago
Hi! Happy to hear DuckDB spatial is useful for you and thanks for opening this issue!
R-Tree indexes are not used for spatial joins yet, only when doing spatial filters, so it makes sense that you wouldn't see it used in the plan. Adding support for joins is of course high on the TODO list but I can't provide an exact timeline.
All the functions you use, ST_ContainsProperly
, ST_Contains
, ST_Within
ST_Centroid
etc. are all implemented by calling into the GEOS library, which will allocate memory outside of DuckDB's control. This is usually the reason why larger spatial joins/queries fail, but since you seem to be well within the memory bounds and the process doesnt get killed by the oom-killer, there might be a bug somewhere in either our code or GEOS. Is there any way you could share the data? Alternatively try to isolate which geometries caused the crash (e.g. by recursively dividing the datasets in half and still see if the segmentation fault occurs, and if so, in which half)?
If you convert and store the geometries into a table first (so that you don't have to call ST_GeomFromHEXEWKB in the main query), does it still crash?
@Maxxen, thank you for your reply. Unfortunately, I can not share the proprietary data, but the suggestions were right-on:
ST_GeomFromHEXEWKB()
column so i was only performing an st_within()
, not calcuating the geometry
as well. I was able to complete this query where the larger table was 152M rows and the smaller table was 1.5M. The maximum RAM usage was 40GiB. The query completed successfully! Thank you.Yet, in another query, using the same data from the first query, I'm trying to perform a spatial join between a 142M row subset table and a 3M row subset table. Same geometries. Using the same Spatial extension functions as the first, passing query:
select ....
from 152_M_row_table large
join 3_M_row_table small
on st_within(geom_centroid, geom)
and it is using up to 57GiB RAM, but eventually hitting the initial segmentation fault
.
I then tried dialing down the precision of my spatial join with a:
select ....
from 152_M_row_table large
join 3_M_row_table small
on st_intersects(geom_centroid, geom_envelope)
geometry
is just the calculated, st_envelope
(minimum bounding box) valuesegmentation fault
. Realizing I only really care how records match to the larger table, I changed the join
to a left join
, which has now shifted from an IE Join
to an NL Blockwise Join
, even with set prefer_range_joins=true;
. Good news: the query plans total rows are the same, but I'm only using up to 9GiB of RAM, now. Not sure what the total run time will be, now, but I'll update here.
I really appreciate you taking the time to answer my questions, and the work you're doing to improve the Spatial extension.
Looking at it more:
join
filter for matching on state
, the query planner winnows to only try joining geometries that first match on state. However, when i try to replicate that filter condition (on state
) in DuckDB, it explodes the join into trillions of rows in the explain
.
Problem
Good morning, again!
When performing an
ST_ContainsProperly
orST_Within
of a centroid in another boundary (the centroids of a 152M record table compared against the large township boundaries of only a 2M record table) -- I'm getting asegmentation fault
.As an aside, I want to say that we are very close to getting DuckDB to work for processing all our spatial data. This spatial join is the last blocker. DuckDB is performing faster than PostGIS. The work done here has been a blessing. I am grateful for everyone here at DuckDB!
Traceback
Steps to re-create
Running this query against a row table stored in
.duckdb
file, locally.geom
column of the large, 152M record table,parcel_batch_1
.segmentation fault
.Initial Query Plan
Alternate Attempt (no
segmentation fault
, but query hangs at 50%)centroid
column onparcel_batch_1
so thest_centroid
needn't be calculated in thejoin
condition, and materialized the Parquet file as a table. I added an R-Tree index on bothparcel_batch_1.centroid
and thegeom
column of the township (formerly the Parquet file) table. Still, no usage in query plan.ST_ContainsProperly
toST_Contains
, as well.Environment:
My Machine: