Closed kylebarron closed 1 year ago
Could you post the example code? I just want to make sure I understand what's happening with the cached envelope idea.
Maybe "pre-computed" is a better word here. Consider MultiPolygon
(my geoarrow implementation) and MultiPolygonWithEnvelope
which contains the bounds already.
/// A MultiPolygon with cached bounding box, to be used when constructing an RTree
pub struct MultiPolygonWithEnvelope<'a> {
pub geom: MultiPolygon<'a>,
pub bounds: ([f64; 2], [f64; 2]),
}
impl<'a> From<MultiPolygon<'a>> for MultiPolygonWithEnvelope<'a> {
fn from(value: MultiPolygon<'a>) -> Self {
let bounds = bounding_rect_multipolygon(&value);
MultiPolygonWithEnvelope {
geom: value,
bounds,
}
}
}
impl<'a> RTreeObject for MultiPolygonWithEnvelope<'a> {
type Envelope = AABB<[f64; 2]>;
fn envelope(&self) -> Self::Envelope {
AABB::from_corners(self.bounds.0, self.bounds.1)
}
}
impl RTreeObject for MultiPolygon<'_> {
type Envelope = AABB<[f64; 2]>;
fn envelope(&self) -> Self::Envelope {
let (lower, upper) = bounding_rect_multipolygon(self);
AABB::from_corners(lower, upper)
}
}
The before and after is just whether the RTree is loaded using MultiPolygon
or MultiPolygonWithEnvelope
.
pub fn before(arr: &'_ MultiPolygonArray) -> RTree<MultiPolygon<'_>> {
let items: Vec<MultiPolygon<'_>> = arr.values_iter().collect();
RTree::bulk_load(items)
}
pub fn after(
arr: &'_ MultiPolygonArray,
) -> RTree<MultiPolygonWithEnvelope<'_>> {
let items: Vec<MultiPolygonWithEnvelope<'_>> =
arr.values_iter().map(|geom| geom.into()).collect();
RTree::bulk_load(items)
}
On my test dataset of 300k multipolygons, it was 7s with before()
and 360ms with after()
, where the only change is pre-computing the bounding rectangles before inserting to the rtree
Ahhh yes I see what you mean. I'm a bit torn on this, however: the bounds have to be calculated sometime, it's really up to you whether you want to split the cost across each instance of MultiPolygonWithEnvelope
or pay it all at once when you create the tree. After all, it's envisaged that the relatively high cost of creating it is offset by the fact that you save time when subsequently querying it. Is the time cost of pre-computing the bounds comparable to the tree creation time difference (6.xx seconds)?
Is the time cost of pre-computing the bounds comparable to the tree creation time difference (6.xx seconds)?
No this is my point: The 360ms in after()
includes calculating the bounds for every geometry. The geom.into()
passes to the From<MultiPolygon<'a>> for MultiPolygonWithEnvelope<'a>
and so that geom.into()
computes the bounding box there.
So I guess the real difference is that RTree insertion is O(n log(n))
, and so if you compute the bounding box as part of fn envelope()
in the RTreeObject
trait, then the bounding box creation step will also be O(n log(n))
. But if you pre-compute the bounding box then you can make that part of the algorithm O(n)
.
Oh. Wow. In that case I wonder whether this is an optimisation that's been missed somewhere or that we can move into the tree creation process. Is the extra time because bounding boxes are calculated repeatedly on tree members while they're being inserted? Maybe we could temporarily cache those results somehow. @Stoeoef Any ideas as to what's going on?
Is the extra time because bounding boxes are calculated repeatedly on tree members while they're being inserted?
That was my assumption but I haven't looked into the code yet to confirm
Idly thinking about this a bit more, (perhaps obviously) I don't think we can cache the envelope in geo-types
geometry structs because geometries can change, so we'd potentially be paying the recalculation cost n times during e.g. coord iteration of a LineString
, or massively complicating the way that we mutate geometries to ensure that we only "finalise" the new envelope at the end of a mutation op. Either way, it doesn't seem feasible.
That leaves us with the possibility of caching the envelopes in RStar's loading mechanism. Two questions arise:
HashMap
keyed on the point coordinates?bulk_load_recursive
function, or the PartititioningTask
or its Iterator
impl?I'm not sure how we might try to experiment here. @rmanoka @adamreichold if you have any ideas feel free to chime in.
How do we cache the points? A fast HashMap keyed on the point coordinates?
I think since we basically memoize the function envelope(&self) -> Self::Envelope
, that something like HashMap<*const Self, Self::Envelope>
, i.e. keyed on object identity would be appropriate.
I think this kind of caching also has some precedence as e.g. BTreeMap
also assumes that its keys do not change their ordering (which would only be possibly via interior mutability in any case).
Agree with @adamreichold in principle. The HashMap
is not a bad idea, but would be a bit code-invasive imho. Looking at the code, the majority of envelope calls or happening via RTreeNode
defined as:
pub enum RTreeNode<T>
where
T: RTreeObject,
{
/// A leaf node, only containing the r-tree object
Leaf(T),
/// A parent node containing several child nodes
Parent(ParentNode<T>),
}
Now, ParentNode<T>
seems to already be memoizing it's envelope (because it's a collection of rtree-nodes); so we might get most of the benefits by adding the memoization here: Leaf(T)
-> Leaf(T, T::Envelope)
? Then we'll just have to adjust the RTreeObject
impl of the RTreeNode
which should reflect in most of the core algo to cache I think.
@rmanoka I've got a first pass of the modified Leaf
field in #116. I'm trying to figure out where else other than
and
to use the memoized value – so far there's no effect on the bulk-load baseline benchmark.
I don't know where my original reproduction code is (shame on me 😅); I would expect it straightforward to reproduce by inserting geo-types polygons. I can come up with a repro case if you need
@urschrei Yes, that's exactly what I'd in mind too. The benchmarks seems to be using points, so they might not benefit much from the caching. I guess complex polygons will see the benefit.
@kylebarron could you please provide us the polygons generation code? Something like this function, but with polygons instead.
I've been testing inserting geometric objects into rstar. I noticed a ~20x speedup with some test data (300k
MultiPolygon
s) between inserting geometries whereRTreeObject::envelope()
was implemented with a direct call tobounding_rect
to inserting wrapper structs that include both the geometry and a pre-cached envelope. This 20x speedup includes the time in the latter case to loop over the geometries and create each's envelope.In retrospect, it's not too surprising that
envelope()
would need to be called multiple times, but I didn't see a mention of it on theRTreeObject
page.