duckdb / duckdb_spatial

MIT License
483 stars 37 forks source link

`segmentation fault` in `st_contains` spatial join #438

Open johngcrowley opened 1 week ago

johngcrowley commented 1 week ago

Problem

Good morning, again!

When performing an ST_ContainsProperly or ST_Within of a centroid in another boundary (the centroids of a 152M record table compared against the large township boundaries of only a 2M record table) -- I'm getting a segmentation fault.

As an aside, I want to say that we are very close to getting DuckDB to work for processing all our spatial data. This spatial join is the last blocker. DuckDB is performing faster than PostGIS. The work done here has been a blessing. I am grateful for everyone here at DuckDB!

Traceback

segmentation fault

Steps to re-create

Running this query against a row table stored in .duckdb file, locally.

          select 
                 b.row_id,
                 t.township
          from parcel_batch_1 b
          join './section_township_range.parquet' t
          on st_containsproperly(st_centroid(b.geom), st_geomfromhexewkb(t.geom));

Initial Query Plan

┌─────────────────────────────┐                                                                                                                                                              
│┌───────────────────────────┐│                                                                                                                                                              
││       Physical Plan       ││                                                                                                                                                              
│└───────────────────────────┘│                                                                                                                                                              
└─────────────────────────────┘                                                                                                                                                              
┌───────────────────────────┐                                                                                                                                                                
│         PROJECTION        │                                                                                                                                                                
│    ────────────────────   │                                                                                                                                                                
│            row_id         │                                                                                                                                                                                                                                                                                                                            
│          township         │                                                                                                                                                                                                                                                                                                                              
│                           │                                                                                                                                                                
│       ~25000000 Rows      │                                                                                                                                                                
└─────────────┬─────────────┘                                                                                                                                                                
┌─────────────┴─────────────┐                                                                                                                                                                
│           FILTER          │                                                                                                                                                                
│    ────────────────────   │                                                                                                                                                                
│    ST_ContainsProperly    │                                                                                                                                                                
│    (ST_Centroid(geom),    │                                                                                                                                                                
│  ST_GeomFromHEXEWKB(geom))│                                                                                                                                                                
│                           │                                                                                                                                                                
│       ~25000000 Rows      │                                                                                                                                                                
└─────────────┬─────────────┘                                                                                                                                                                
┌─────────────┴─────────────┐                                                                                                                                                                
│          IE_JOIN          │                                                                                                                                                                
│    ────────────────────   │                                                                                                                                                                
│      Join Type: INNER     │             
│                           │                                                                                                                                                                
│        Conditions:        │                                                                                                                                                                
│     ST_XMin(ST_Extent     │                                                                                                                                                                
│  (ST_Centroid(geom))) <=  │                                                                                                                                                                
│      ST_XMax(ST_Extent    │                                                                                                                                                                
│(ST_GeomFromHEXEWKB(geom)))│                                                                                                                                                                
│     ST_XMax(ST_Extent     │                                                                                                                                                                
│  (ST_Centroid(geom))) >=  │                                                                                                                                                                
│      ST_XMin(ST_Extent    ├──────────────┐                                                                                                                                                 
│(ST_GeomFromHEXEWKB(geom)))│              │                                                                                                                                                 
│     ST_YMin(ST_Extent     │              │                                                                                                                                                 
│  (ST_Centroid(geom))) <=  │              │                                                                                                                                                 
│      ST_YMax(ST_Extent    │              │
│(ST_GeomFromHEXEWKB(geom)))│              │
│     ST_YMax(ST_Extent     │              │
│  (ST_Centroid(geom))) >=  │              │
│      ST_YMin(ST_Extent    │              │
│(ST_GeomFromHEXEWKB(geom)))│              │
│                           │              │
│       ~25000000 Rows      │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││       PARQUET_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│       parcel_batch_1      ││         Function:         │
│                           ││        PARQUET_SCAN       │
│        Projections:       ││                           │
│            geom           ││        Projections:       │
│            row_id         ││            geom           │
│                           ││          township         │
│                           ││                           │
│       ~25000000 Rows      ││       ~2379868 Rows       │
└───────────────────────────┘└───────────────────────────┘

Alternate Attempt (no segmentation fault, but query hangs at 50%)

Environment:

 % uname -a
Linux john-XPS-15-9520 6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

 % duckdb --version
v1.1.2 f680b7d08f

My Machine:

Maxxen commented 1 week ago

Hi! Happy to hear DuckDB spatial is useful for you and thanks for opening this issue!

R-Tree indexes are not used for spatial joins yet, only when doing spatial filters, so it makes sense that you wouldn't see it used in the plan. Adding support for joins is of course high on the TODO list but I can't provide an exact timeline.

All the functions you use, ST_ContainsProperly, ST_Contains, ST_Within ST_Centroid etc. are all implemented by calling into the GEOS library, which will allocate memory outside of DuckDB's control. This is usually the reason why larger spatial joins/queries fail, but since you seem to be well within the memory bounds and the process doesnt get killed by the oom-killer, there might be a bug somewhere in either our code or GEOS. Is there any way you could share the data? Alternatively try to isolate which geometries caused the crash (e.g. by recursively dividing the datasets in half and still see if the segmentation fault occurs, and if so, in which half)?

If you convert and store the geometries into a table first (so that you don't have to call ST_GeomFromHEXEWKB in the main query), does it still crash?

johngcrowley commented 4 days ago

@Maxxen, thank you for your reply. Unfortunately, I can not share the proprietary data, but the suggestions were right-on:


Yet, in another query, using the same data from the first query, I'm trying to perform a spatial join between a 142M row subset table and a 3M row subset table. Same geometries. Using the same Spatial extension functions as the first, passing query:

select ....
from 152_M_row_table large
join 3_M_row_table small
on st_within(geom_centroid, geom)

and it is using up to 57GiB RAM, but eventually hitting the initial segmentation fault.

I then tried dialing down the precision of my spatial join with a:

select ....
from 152_M_row_table large
join 3_M_row_table small
on st_intersects(geom_centroid, geom_envelope) 

Realizing I only really care how records match to the larger table, I changed the join to a left join, which has now shifted from an IE Join to an NL Blockwise Join, even with set prefer_range_joins=true;. Good news: the query plans total rows are the same, but I'm only using up to 9GiB of RAM, now. Not sure what the total run time will be, now, but I'll update here.

I really appreciate you taking the time to answer my questions, and the work you're doing to improve the Spatial extension.

johngcrowley commented 3 days ago

Looking at it more: