Closed marklit closed 4 months ago
Hello! Thanks for opening this issue!
This is expected behavior. The reason why its so much faster to query by bounding box is that the parquet file contains "statistics" for the box struct column - basically parquet files are divided into "row groups" every 100000~ rows or so, and for each row group the min/max value of each component of each column within that row group is stored separately. This enables DuckDB to "push down" some predicates and skip reading entire row groups by doing a quick check against the row group statistics before it starts scanning as it knows that the predicates (e.g. > and < in this case) will always be false.
Only basic predicates like range and equality comparisons can be pushed down, as the parquet extension is not aware of geospatial properties. It can't in general figure out that ST_CONTAINS(ST_BUFFER(ST_POINT(35.41, 139.75), .3), ST_POINT(bbox.xmin, bbox.ymin))
, implies that ST_CONTAINS(ST_BUFFER(ST_POINT(35.41, 139.75), .3), ST_POINT(st_xmin(geom), st_ymin(geom))
for all geometries in the row group. Even if it could, in this case the implication only holds true for point geometries, but the type of the geometry is not part of the statistics so we wouldn't be able to detect that without scanning everything anyway.
This means that in your case all the geometries have to be pulled down by DuckDB before it can start applying the filter indeed. In think a potential solution would be to include both the bbox range predicates and the buffer predicate in your query with another AND
clause as I think we can push down only the bbox range part into the parquet scan to prune out row groups while still performing the buffer containment check in duckdb.
I'm running
v0.10.2 1601d94f94
on Ubuntu for Windows.Each of the four iterations of the loop below takes around two hours to complete.
If there WHERE clause uses the following then a single iteration will take over 10 hours (I'm not sure exactly how long because I cancelled the operation).
I know it's comparing a radius rather than a box but I'm still surprised to see the perf difference. During the operation, I noticed my internet connection maxing out at 500 Mbps. I'm wondering if the entire geometry field is needing to be downloaded for every record and not just the matches. There are 100s of files being looked at but only one has any records so I wouldn't expect bandwidth to be a bottleneck.