duckdb / duckdb_spatial

MIT License
425 stars 32 forks source link

Feature: Add Spatial Index support #7

Open Maxxen opened 1 year ago

Maxxen commented 1 year ago

Ideally we implement an R*-TREE, but we should research what the current best bulk-loading algorithm is, since we basically only do bulk-loads by indexing vectors at a time.

A first step might be to just use the ART index on a bounding box expression, or use some clever space-filling curve encoding like a hilbert curve or geohash. We could also look into alternative encodings like "quad keys" or "s2/h3 cells". Once we have some spatial predicate functions implemented we should implement optimizer rules to look for these sorts of indexes and compare how useful they are performance-wise. Its probably not gonna beat an R-Tree but will probably be easier to implement to start with.

joshhopkins commented 1 year ago

Would love to see S2/H3 cell support.

igor-suhorukov commented 1 year ago

My publication and OSM converter could be useful for H3 OpenStreetMap data indexing and as real world dataset for DuckDB Spatial extension.

colinjamesstewart commented 8 months ago

@Maxxen Excited to see this on the roadmap -- having spatial indices would make my organization seriously consider using DuckDB extensively. Is there a (very) approximate timeline for adding this feature? Much appreciated, thanks.

Maxxen commented 8 months ago

Hi @colinjamesstewart! Unfortunately not, we might maybe be able to start prototyping non-peristent in-memory only indexes before the end of the year. But would you mind telling me a little bit about the workflows you do that makes you think the lack of spatial indexes would be a blocker right now? We recently landed a pretty big optimization for spatial predicate joins, and we can do the same for within-range/knearest joins too.

colinjamesstewart commented 8 months ago

Thanks, @Maxxen. We have a number of spatial databases with millions or even billions of records across Canada and the US, and we often need to do spatial queries (e.g. ST_Contains, ST_Intersection, etc.) on specific regions, such as cities and metro areas. I'll look into how well DuckDB handles queries on these databases, but in my experience, spatial indexing is essential when working with this amount of data.

Maxxen commented 8 months ago

@colinjamesstewart For sure, I understand that once you reach billions of records, being able to prune out intersection candidates at the storage level becomes more or less mandatory, but at city-region scale we are pretty efficient at brute-forcing it. There's still a lot to figure out when it comes to custom indexes in DuckDB in general but we progressing towards that over at the core project and hopefully we will be able to utilize that work in spatial soon.

colinjamesstewart commented 8 months ago

@Maxxen I ran a quick spatial-query performance test (using ST_Intersects) to compare DuckDB with Sedona, which we'd previously been using. DuckDB was often 30x faster. Performance starts to degrade when going from a table with millions of records to one with billions, but the performance on tables with <50M records is pretty impressive given that there's no spatial indexing.

sedot42 commented 5 months ago

Just another use case here: We would like to do 2D points-in-polygon-filtering on a point cloud (xyz data) with 4B points. Performance decreases roughly linearly with table size. With 700k points a query takes about 0.07s, with 7M points about 0.7s, with 70M points about 7s. With 4B points, analysis is a bit cumbersome. If the data fits in memory, we can do much faster queries with GeoPandas and its spatial index (R-tree afaik).

raphaellaude commented 4 months ago

Does duckdb-spatial utilize flatgeobuf spatial indices if they're available? Since they're optionally part of the file itself supporting spatial indexing for flatgeobuf files might be a nice intermediate step on the way to full spatial index support.

Maxxen commented 4 months ago

Yes, since our flatgeobuf support is based on gdal we do support pruning when reading flatgeobuf files, there is an optional "filter_bbox" argument you can pass to st_read to use it.

cboettig commented 4 months ago

Do you think the h3 extension, https://github.com/isaacbrodsky/h3-duckdb, might become an official submodule of this module at some point? I think it could be very valuable, but unless I'm miss-understanding, presently users need to build that module with duckdb from source?

Maxxen commented 4 months ago

Yes and no. We might add some utility functions in spatial to make it easier to integrate with H3 andI've briefly exchanged a few words with Isaac about DuckDB labs eventually adopting and distributing the H3 extension as a separate extension. But we dont really have a set policy on adopting third party extensions taking into account e.g. security, maintenance and support.

However, we are actively working on enabling support for some form of custom extension repositories that you can opt-into so that you can do e.g. INSTALL <extension> FROM <some repository>, with the goal of us being able to set up an official "community" extension repository for commonly used third-party extensions, where do some basic vetting, set up nightly/weekly builds and host binaries. The H3 extension would be an ideal candidate to include in this repository of community extensions once we get something up.