georust / rstar

R*-tree spatial index for the Rust ecosystem
https://docs.rs/rstar
Apache License 2.0
386 stars 67 forks source link

Bulk loading using space-filling curve algorithm(s) #102

Open urschrei opened 2 years ago

urschrei commented 2 years ago

rstar currently supports bulk-loading using the OMT algorithm. However, using a Hilbert curve to order the nodes can result in a tree with better storage utilisation and less overlap. There is an extant implementation for an R tree at https://github.com/flatgeobuf/flatgeobuf/blob/master/src/rust/src/packed_r_tree.rs#L193 that should be relatively easily portable.

A related issue is then the creation of a new bulk-loading API which allows selection of the bulk loading algorithm, depending on the application.

urschrei commented 2 years ago

Some related issues:

  1. OMT is top-down, and uses a heuristic based on the number of entries to determine height and number of items per node (p71). Hilbert (and other space-filling curve algorithms) are bottom-up, and seem to use a default height of 16, which allows us to similarly calculate node sizes. However, it's not clear where the magic "16" number was arrived at, and under what conditions one might wish to tune that.

  2. It would be great to know how much of the existing bulk-loading machinery (https://github.com/georust/rstar/blob/master/rstar/src/algorithm/bulk_load/cluster_group_iterator.rs, https://github.com/georust/rstar/blob/master/rstar/src/algorithm/bulk_load/bulk_load_sequential.rs#L14-L87) could be re-used for the Hilbert bulk load. Is it enough to calculate the leaf data bbox Hilbert score, sort, then recursively insert using the new order? Some pointers from @Stoeoef about the existing bulk-load functionality would be really useful here.

kylebarron commented 2 years ago
  1. it's not clear where the magic "16" number was arrived at, and under what conditions one might wish to tune that

Don't have a reference for this but I thought 16 might have been the maximum precision you can get with 32 bit coordinates

Stoeoef commented 1 year ago

I'm not sure if much of the existing functionality can be adapted to be honest - the top-down approach seems to be "too different" from the bottom up approach.

I don't think that this is a problem though - Hilbert curves are probably easier to implement (?) than the existing implementation anyway. The existing code is somewhat complex as it has to work in arbitrary dimensions. From memory, the current code successively partitions the elements along all axis and then recurses into the final clusters. For example: For two dimensions with a MAX_SIZE of 6, the algorithm would

This results in 2 * 3 = 6 sub-partitions with known AABB that will form a parent node in the rtree. Then, each sub-partition is broken down recursively until the number of items is small enough to be put into a leaf node.

For Hilbert curves, higher dimensions don't seem to be a concern for the rtree - it only matters for calculating the Hilbert score. That's why I can well imagine that this implementation is going to be quite different and simpler.

I think the reason why I've decided against using space filling curves is that I've never really understood how to adapt them well to data that is heavily skewed. From my understanding, the Hilbert curve wouldn't be able to assign different Hilbert scores for elements that are very close together, especially in higher dimensions. I didn't look too deeply into that though - maybe this is more a theoretical concern that doesn't happen in practice?

kylebarron commented 1 year ago

I think maybe it's worthwhile to split this into two issues? 1) a bulk loading API and 2) implementing OMT?

I think bulk loading is very useful for a lot of cases where you don't need the tree to be dynamic. I'm not familiar with the OMT algorithm; from the paper's introduction it seems they claim it's an improvement on STR?

I think the reason why I've decided against using space filling curves is that I've never really understood how to adapt them well to data that is heavily skewed

I've never worked with so heavily skewed data that a hilbert-packed r tree was a bad fit. I have noticed that hilbert-packed r tree implementations seem to first compute the total bounds of the data and constrain the scale of the hilbert values to that space. So if you're working with lon/lat data covering the US you don't need to compute hilbert values over [-180, -90, 180, 90], which helps with precision.

urschrei commented 1 year ago

implementing OMT

For the avoidance of confusion: OMT is the current bulk-loading algorithm – it's just that the API is intertwined with the implementation because in the distant past rstar didn't have a bulk-loading API and I asked for one using OMT, which @Stoeoef duly implemented.

kylebarron commented 1 year ago

Oh whoops I missed that a bulk loading API already existed 🤦‍♂️