duckdb / duckdb_spatial

MIT License
489 stars 40 forks source link

ST_CONTAINS in WHERE statement >5x slower than bounding box predicate #364

Closed marklit closed 4 months ago

marklit commented 4 months ago

I'm running v0.10.2 1601d94f94 on Ubuntu for Windows.

Each of the four iterations of the loop below takes around two hours to complete.

for RELEASE in 2024-06-13-beta.1 \
               2024-06-13-beta.0 \
               2024-05-16-beta.0 \
               2024-04-16-beta.0; do
    echo $RELEASE
    WORKING_DB="working.`uuidgen`.duckdb"
    echo "COPY (
              SELECT * EXCLUDE (geometry,
                                bbox,
                                names,
                                sources),
                     ST_GEOMFROMWKB(geometry) geom,
                     JSON(bbox)    AS bbox,
                     JSON(names)   AS names,
                     JSON(sources) AS sources
              FROM read_parquet('s3://overturemaps-us-west-2/release/$RELEASE/theme=buildings/type=building/*',
                                filename=true,
                                hive_partitioning=1)
              WHERE bbox.xmin BETWEEN 139.45 AND 140.05
              AND   bbox.ymin BETWEEN  35.11 AND 35.709999999999994
          )
          TO 'tokyo.$RELEASE.gpkg'
              WITH (FORMAT GDAL,
                    DRIVER 'GPKG',
                    LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');" \
            | ~/duckdb -unsigned $WORKING_DB
    rm $WORKING_DB $WORKING_DB.wal || true
done

If there WHERE clause uses the following then a single iteration will take over 10 hours (I'm not sure exactly how long because I cancelled the operation).

ST_CONTAINS(ST_BUFFER(ST_POINT(35.41, 139.75), .3),
            ST_POINT(bbox.xmin, bbox.ymin))

I know it's comparing a radius rather than a box but I'm still surprised to see the perf difference. During the operation, I noticed my internet connection maxing out at 500 Mbps. I'm wondering if the entire geometry field is needing to be downloaded for every record and not just the matches. There are 100s of files being looked at but only one has any records so I wouldn't expect bandwidth to be a bottleneck.

Maxxen commented 4 months ago

Hello! Thanks for opening this issue!

This is expected behavior. The reason why its so much faster to query by bounding box is that the parquet file contains "statistics" for the box struct column - basically parquet files are divided into "row groups" every 100000~ rows or so, and for each row group the min/max value of each component of each column within that row group is stored separately. This enables DuckDB to "push down" some predicates and skip reading entire row groups by doing a quick check against the row group statistics before it starts scanning as it knows that the predicates (e.g. > and < in this case) will always be false.

Only basic predicates like range and equality comparisons can be pushed down, as the parquet extension is not aware of geospatial properties. It can't in general figure out that ST_CONTAINS(ST_BUFFER(ST_POINT(35.41, 139.75), .3), ST_POINT(bbox.xmin, bbox.ymin)), implies that ST_CONTAINS(ST_BUFFER(ST_POINT(35.41, 139.75), .3), ST_POINT(st_xmin(geom), st_ymin(geom)) for all geometries in the row group. Even if it could, in this case the implication only holds true for point geometries, but the type of the geometry is not part of the statistics so we wouldn't be able to detect that without scanning everything anyway.

This means that in your case all the geometries have to be pulled down by DuckDB before it can start applying the filter indeed. In think a potential solution would be to include both the bbox range predicates and the buffer predicate in your query with another AND clause as I think we can push down only the bbox range part into the parquet scan to prune out row groups while still performing the buffer containment check in duckdb.