Toblerity / rtree

Rtree: spatial index for Python GIS
https://rtree.readthedocs.io
MIT License
631 stars 124 forks source link

Add bulk insert/query API #337

Open FreddieWitherden opened 1 week ago

FreddieWitherden commented 1 week ago

One means of greatly reducing overhead is to add support for bulk insert and query APIs based around NumPy arrays. For example, at the moment when one creates an index the most efficient option currently available is to pass in a generator. This generator is then converted to a callback which is passed to the C API. Overall, this is quite a bit less efficient in the non-object case compared to passing a NumPy array of (N,) idxs and (N, 2*ndim) coordinates. With the latter only a single foreign function call would be required consisting of two pointers and a count.

Similar benefits could be had on the query side where one could pass an array of (N, 2*ndim) coordinates to the C API. The resulting idx's could then be returned as a pair of arrays: one of size (M,) consisting of the idx's themselves and another of size (N + 1,) consisting of displacements into this idx array (saying where each query point begins). This would again greatly reduce the amount of time spent in Python/ctypes and also simplify memory management.

From an implementation standpoint this would require some minor additions to libspatialindex but they should be relatively straightforwards (just simple loops which call the C++ routines would suffice as a first pass). The result would be a 'fast path' for when one does not need object support and is working with data already in some native array format.

hobu commented 1 week ago

From an implementation standpoint this would require some minor additions to libspatialindex but they should be relatively straightforwards

Yes, please!