georust / rstar

R*-tree spatial index for the Rust ecosystem
https://docs.rs/rstar
Apache License 2.0
386 stars 67 forks source link

How to use RTree as a set? #82

Open akriegman opened 2 years ago

akriegman commented 2 years ago

I am trying to do the following operations on an RTree so that I can use it as a set:

For the first operation this can be done with existing methods on RTree:

if !tree.contains(&p) { tree.insert(p); }

However, while I don't know the internals of RTree I'm guessing this approach traverses the tree twice when it's only necessary to traverse it once. Am I right that we could do this operation more efficiently if we worked with the internals of RTree, for example adding an insert_unique method? The above approach is probably good enough for my use case anyways.

The second operation is a little harder. Here is my code for merging two trees if I didn't care about duplicates:

let tree = RTree::bulk_load(
    left.iter()
    .chain(right.iter())
    .copied()
    .collect()
);

From there I guess I could iterate over the tree checking whether each element equals it's second nearest neighbor to remove duplicates, but this feels hacky. I'd imagine that if we had some sort of insert_unique method, then we could rewrite bulk_load using insert_unique in place of insert to get a method that makes a tree without duplicates.

Another alternative is I can just accept duplicates in my set. Then in order to pop / remove an element I would have to remove all instances like this:

while let Some(_) = tree.remove(&p) {}

which wouldn't be too costly if there's only two or three of each element at most.

So I think my options for using RTree as a set are:

So what are people's thoughts? Is anyone else interested in having an RTreeSet in rstar? How much worse will my performance be if I continue using the hacky lazy approach?

rmanoka commented 2 years ago

However, while I don't know the internals of RTree I'm guessing this approach traverses the tree twice when it's only necessary to traverse it once. Am I right that we could do this operation more efficiently if we worked with the internals of RTree, for example adding an insert_unique method? The above approach is probably good enough for my use case anyways.

Yeah, insert_unique would indeed be quite useful.

The second operation is a little harder. Here is my code for merging two trees if I didn't care about duplicates:

let tree = RTree::bulk_load(
    left.iter()
    .chain(right.iter())
    .copied()
    .collect()
);

From there I guess I could iterate over the tree checking whether each element equals it's second nearest neighbor to remove duplicates, but this feels hacky. I'd imagine that if we had some sort of insert_unique method, then we could rewrite bulk_load using insert_unique in place of insert to get a method that makes a tree without duplicates.

The use-case for bulk_load is to load a slice faster than iteratively. Using insert_unique, the user can iterate, but I guess the bulk load logic itself would have to be tweaked a bit to handle unique-ness.

* Work with the internals of `RTree` to do these operations more efficiently. I could either structure this as a wrapper struct around `RTree` in my own crate, or contribute this to rstar. I'd imagine I'd have to be the one to contribute this to rstar, since I think this use case is not a very common one.

Pls. do provide a PR if possible for insert_unique! It seems to involve tweaking a few core parts of the algo: RTreeInsertionStrategy and InsertionStrategy. It's not a huge concern, but this may be a breaking change.

urschrei commented 2 years ago

I guess the bulk load logic itself would have to be tweaked a bit to handle unique-ness.

We could provide a bulk_load_unique method which uses e.g. a HashSet to dedupe the provided vec (elements have to be Copy, and it's an O(n) op) and we wouldn't have to tweak the insertion logic otherwise. It's growing the API, but I tend to think of things in terms of OGC geoms where we don't tend to see much duplication, but that's actually a narrow use case, so if ppl are OK with a new method (no dupes potentially produces a higher-quality tree too)…

adamreichold commented 2 years ago

We could provide a bulk_load_unique method which uses e.g. a HashSet to dedupe the provided vec (elements have to be Copy, and it's an O(n) op) and we wouldn't have to tweak the insertion logic otherwise.

In this case, wouldn't it be preferable to keep the deduplication outside of the rstar API so that the calling code could choose the most appropriate way to implement the deduplication, e.g. sorting the vector or hashing only a key type?

Stoeoef commented 2 years ago

In this case, wouldn't it be preferable to keep the deduplication outside of the rstar API so that the calling code could choose the most appropriate way to implement the deduplication, e.g. sorting the vector or hashing only a key type?

I think offering a deduplication method based on sorting would be a nice addition. It should also be fairly straight forward to implement.

I've also looked briefly into the bulk loading code and don't see a good way to do the de-duplication "on the fly". The algorithm really expects that all elements of the source collection will become an element in the final r-tree.

Stoeoef commented 2 years ago

One open question for me: Should we implement insert_unique or rather something like the entry API for HashMaps?

The latter would give the caller a few more options, e.g.

rtree.entry(some_bounding_box).or_insert(some_new_element); // This is the equivalent to insert_unique

rtree.entry(some_bounding_box).or_default().counter += 1; // Assuming we're inserting some objects with a counter field

I would also be fine with an insert_unique method for now and possibly implementing an entry like API later on, deprecating insert_unique.

adamreichold commented 2 years ago

One open question for me: Should we implement insert_unique or rather something like the entry API for HashMaps?

I think an entry-style API is definitely preferable but would possibly also imply more work.

I think there are also a few complications for set-style in contrast to map-style usage, e.g. you write rtree.entry(some_bounding_box) instead of rtree.entry(some_element) which raises the question (maybe also for insert_unique) whether equality should only consider the envelope or any other data possibly attached to the items.

Also, the standard library's and hashbrown's HashSet do not actually expose an entry method, but rather just get_or_insert_with which starts the look-up from any borrowed form of the element type, c.f. https://github.com/rust-lang/rust/pull/60894 and https://github.com/rust-lang/hashbrown/pull/98

akriegman commented 2 years ago

For the record, I went with the simple solution where I just hacked stuff together using the existing API, and I'm getting quite good performance for my use case. For the union we can remove duplicates in O(nlogn) time using the existing tree structure, and the bulk load is O(nlogn) anyways:

  fn union(lval: RTree, mut rval: RTree) -> RTree {
    lval.iter().for_each(|p| {
      rval.remove(p);
    });
    RTree::bulk_load(lval.iter().chain(rval.iter()).copied().collect());
  }

And for inserting I just check if it's already a member as I mentioned above. After looking at the code for insert, it looks fairly complex and I believe it can traverse the tree several times if it reinserts, so in comparison I think contains is a much smaller operation.