georust / rstar

R*-tree spatial index for the Rust ecosystem
https://docs.rs/rstar
Apache License 2.0
386 stars 67 forks source link

Document in `RTreeObject` that `envelope()` should be cached #108

Closed kylebarron closed 1 year ago

kylebarron commented 1 year ago

I've been testing inserting geometric objects into rstar. I noticed a ~20x speedup with some test data (300k MultiPolygons) between inserting geometries where RTreeObject::envelope() was implemented with a direct call to bounding_rect to inserting wrapper structs that include both the geometry and a pre-cached envelope. This 20x speedup includes the time in the latter case to loop over the geometries and create each's envelope.

In retrospect, it's not too surprising that envelope() would need to be called multiple times, but I didn't see a mention of it on the RTreeObject page.

urschrei commented 1 year ago

Could you post the example code? I just want to make sure I understand what's happening with the cached envelope idea.

kylebarron commented 1 year ago

Maybe "pre-computed" is a better word here. Consider MultiPolygon (my geoarrow implementation) and MultiPolygonWithEnvelope which contains the bounds already.

/// A MultiPolygon with cached bounding box, to be used when constructing an RTree
pub struct MultiPolygonWithEnvelope<'a> {
    pub geom: MultiPolygon<'a>,
    pub bounds: ([f64; 2], [f64; 2]),
}

impl<'a> From<MultiPolygon<'a>> for MultiPolygonWithEnvelope<'a> {
    fn from(value: MultiPolygon<'a>) -> Self {
        let bounds = bounding_rect_multipolygon(&value);
        MultiPolygonWithEnvelope {
            geom: value,
            bounds,
        }
    }
}

impl<'a> RTreeObject for MultiPolygonWithEnvelope<'a> {
    type Envelope = AABB<[f64; 2]>;

    fn envelope(&self) -> Self::Envelope {
        AABB::from_corners(self.bounds.0, self.bounds.1)
    }
}

impl RTreeObject for MultiPolygon<'_> {
    type Envelope = AABB<[f64; 2]>;

    fn envelope(&self) -> Self::Envelope {
        let (lower, upper) = bounding_rect_multipolygon(self);
        AABB::from_corners(lower, upper)
    }
}

The before and after is just whether the RTree is loaded using MultiPolygon or MultiPolygonWithEnvelope.

pub fn before(arr: &'_ MultiPolygonArray) -> RTree<MultiPolygon<'_>> {
    let items: Vec<MultiPolygon<'_>> = arr.values_iter().collect();
    RTree::bulk_load(items)
}

pub fn after(
    arr: &'_ MultiPolygonArray,
) -> RTree<MultiPolygonWithEnvelope<'_>> {
    let items: Vec<MultiPolygonWithEnvelope<'_>> =
        arr.values_iter().map(|geom| geom.into()).collect();
    RTree::bulk_load(items)
}

On my test dataset of 300k multipolygons, it was 7s with before() and 360ms with after(), where the only change is pre-computing the bounding rectangles before inserting to the rtree

urschrei commented 1 year ago

Ahhh yes I see what you mean. I'm a bit torn on this, however: the bounds have to be calculated sometime, it's really up to you whether you want to split the cost across each instance of MultiPolygonWithEnvelope or pay it all at once when you create the tree. After all, it's envisaged that the relatively high cost of creating it is offset by the fact that you save time when subsequently querying it. Is the time cost of pre-computing the bounds comparable to the tree creation time difference (6.xx seconds)?

kylebarron commented 1 year ago

Is the time cost of pre-computing the bounds comparable to the tree creation time difference (6.xx seconds)?

No this is my point: The 360ms in after() includes calculating the bounds for every geometry. The geom.into() passes to the From<MultiPolygon<'a>> for MultiPolygonWithEnvelope<'a> and so that geom.into() computes the bounding box there.

So I guess the real difference is that RTree insertion is O(n log(n)), and so if you compute the bounding box as part of fn envelope() in the RTreeObject trait, then the bounding box creation step will also be O(n log(n)). But if you pre-compute the bounding box then you can make that part of the algorithm O(n).

urschrei commented 1 year ago

Oh. Wow. In that case I wonder whether this is an optimisation that's been missed somewhere or that we can move into the tree creation process. Is the extra time because bounding boxes are calculated repeatedly on tree members while they're being inserted? Maybe we could temporarily cache those results somehow. @Stoeoef Any ideas as to what's going on?

kylebarron commented 1 year ago

Is the extra time because bounding boxes are calculated repeatedly on tree members while they're being inserted?

That was my assumption but I haven't looked into the code yet to confirm

urschrei commented 1 year ago

Idly thinking about this a bit more, (perhaps obviously) I don't think we can cache the envelope in geo-types geometry structs because geometries can change, so we'd potentially be paying the recalculation cost n times during e.g. coord iteration of a LineString, or massively complicating the way that we mutate geometries to ensure that we only "finalise" the new envelope at the end of a mutation op. Either way, it doesn't seem feasible.

That leaves us with the possibility of caching the envelopes in RStar's loading mechanism. Two questions arise:

  1. How do we cache the points? A fast HashMap keyed on the point coordinates?
  2. Where do we initialise and store the cache and how do we pass it around? The bulk_load_recursive function, or the PartititioningTask or its Iterator impl?

I'm not sure how we might try to experiment here. @rmanoka @adamreichold if you have any ideas feel free to chime in.

adamreichold commented 1 year ago

How do we cache the points? A fast HashMap keyed on the point coordinates?

I think since we basically memoize the function envelope(&self) -> Self::Envelope, that something like HashMap<*const Self, Self::Envelope>, i.e. keyed on object identity would be appropriate.

I think this kind of caching also has some precedence as e.g. BTreeMap also assumes that its keys do not change their ordering (which would only be possibly via interior mutability in any case).

rmanoka commented 1 year ago

Agree with @adamreichold in principle. The HashMap is not a bad idea, but would be a bit code-invasive imho. Looking at the code, the majority of envelope calls or happening via RTreeNode defined as:

pub enum RTreeNode<T>
where
    T: RTreeObject,
{
    /// A leaf node, only containing the r-tree object
    Leaf(T),
    /// A parent node containing several child nodes
    Parent(ParentNode<T>),
}

Now, ParentNode<T> seems to already be memoizing it's envelope (because it's a collection of rtree-nodes); so we might get most of the benefits by adding the memoization here: Leaf(T) -> Leaf(T, T::Envelope) ? Then we'll just have to adjust the RTreeObject impl of the RTreeNode which should reflect in most of the core algo to cache I think.

urschrei commented 1 year ago

@rmanoka I've got a first pass of the modified Leaf field in #116. I'm trying to figure out where else other than

https://github.com/georust/rstar/pull/116/files#diff-31af77a9bc94272358f1b9ba61db572f5054a039947868df24ca42b3e82d5fafR55

and

https://github.com/georust/rstar/pull/116/files#diff-31af77a9bc94272358f1b9ba61db572f5054a039947868df24ca42b3e82d5fafR139-R140

to use the memoized value – so far there's no effect on the bulk-load baseline benchmark.

kylebarron commented 1 year ago

I don't know where my original reproduction code is (shame on me 😅); I would expect it straightforward to reproduce by inserting geo-types polygons. I can come up with a repro case if you need

rmanoka commented 1 year ago

@urschrei Yes, that's exactly what I'd in mind too. The benchmarks seems to be using points, so they might not benefit much from the caching. I guess complex polygons will see the benefit.

@kylebarron could you please provide us the polygons generation code? Something like this function, but with polygons instead.