Closed urschrei closed 1 year ago
While this does seem straight forward to implement, doesn't it have the downside that we keep the cached envelopes around instead of throwing them away after building? There is probably no additional memory consumption in many cases if T
is smaller than Vec<RTreeNode<T>>
so that RTreeNode
does not grow, but if T
is larger than Vec<RTreeNode<T>>
this could lead to higher long-term memory consumption.
(As a general observation unrelated to the question of caching envelopes, I suspect that storing a single Vec<RTreeNode<T>>
and have children: Range<usize>
in ParentNode
, i.e. tree-wide arena allocation of nodes, could have a positive performance impact.)
This is going to be super-annoying to test because rstar doesn't know what a linestring is. The more I have to actually work with this crate, the more I find that every aspect of its interaction with the geo
ecosystem is filled with papercuts at every turn.
We really need to do something here: A fast spatial index is a cornerstone of a massive amount of functionality for geo
's algorithms, and every attempt to improve anything is foundering because you first have to re-invent the wheel. I just don't have time to spend two days figuring out how to integrate wkt
just so I can dump a few arrays into a data structure.
@urschrei Agreed; it's not the best yet. I've added some simple polygon based tests here. Please merge it if good. The bulk loading was actually not using the RTreeNode optimization we've added, and I've fixed that too. It gives a decent 10x boost even on polygons with 64 sides.
Bulk load complex geom time: [398.98 µs 400.06 µs 401.10 µs]
change: [-99.043% -99.040% -99.037%] (p = 0.00 < 0.05)
Performance has improved.
Bulk load complex geom time: [5.1578 ms 5.1630 ms 5.1721 ms]
change: [+1189.8% +1193.3% +1196.6%] (p = 0.00 < 0.05)
Performance has regressed.
I do agree with @adamreichold concern about space too. This will blow up the memory usage of the rtree for the "usual" point case by about 3x. I'm for merging this though.
@rmanoka Amazing, thank you!
Regarding memory use, could we specify three RTreeNode
members? That would allow us to choose a memoizing (fast bulk inserts, larger memory use) version or the existing (slower inserts, lower memory use). It would mean some extra complexity as we match on the leaf variant and then choose a bulk load function. Very much thinking out loud here, though…
(I'm inclined to merge as-is soon, given the improvement, but happy for more discussion).
Regarding memory use, could we specify three RTreeNode members? That would allow us to choose a memoizing (fast bulk inserts, larger memory use) version or the existing (slower inserts, lower memory use). It would mean some extra complexity as we match on the leaf variant and then choose a bulk load function. Very much thinking out loud here, though…
Enums have the size of their largest variants plus space for the tag, so without additional boxing this would not help. Also due to the additional complexity, I think we would fare better just doing the work of wiring in an explicit cache like HashMap<*const T, T::Envelope>
which can be summarily dropped after building is done.
Enums have the size of their largest variants plus space for the tag, so without additional boxing this would not help
Ugh, completely forgot.
I think we would fare better just doing the work of wiring in an explicit cache like HashMap<*const T, T::Envelope> which can be summarily dropped after building is done.
Where would it "live" though?
Where would it "live" though?
My first try would be on the stack at the top of bulk_load_sequential
and passed as &mut HashMap<*const T, T::Envelope>
into bulk_load_recursive
and from there threaded through PartitioningTask
etc.
If my reading of the code is correct, the main change to use an explicit on-stack cache besides threading parameters through would be that Envelope::partition_envelopes
needs to access the cache.
Theoretically, we could even pass it remaining: &mut [(T, T::Envelope)]
and avoid the hash table entirely at the cost of computing envelopes up front instead of lazily on demand. (I think this would mean that we uselessly compute the envelopes of the leaf nodes. Has anyone any idea how expensive this would be?)
So something like the following diff for eagerly computing the envelopes
Doing it lazily using once_cell::unsync::OnceCell
also seems reasonably even though I would want to see benchmarks showing that it is actually worth it compared to computing all the envelopes up front which most likely has much better cache locality and could also be trivially parallelized:
I guess the call to envelope_for_children
in ParentNode::new_parent
should also make use of the cached envelopes if out-of-line storage is used. I think this suggests that the lazy variant would not be useful after all, since this would access exactly those envelopes which might not have been accessed during selection.
I guess the call to
envelope_for_children
inParentNode::new_parent
should also make use of the cached envelopes if out-of-line storage is used. I think this suggests that the lazy variant would not be useful after all, since this would access exactly those envelopes which might not have been accessed during selection.
I've just finished applying your first diff, and the timing vs @rmanoka's changes (baseline is 873.54 µs
) is not encouraging so far:
Bulk load complex geom time: [148.10 ms 148.51 ms 148.91 ms]
change: [+15882% +16210% +16515%] (p = 0.00 < 0.05)
Performance has regressed.
See updates below: this wasn't a valid comparison due to tree size differences.
I've opened a draft PR with your initial suggested changes and the new benchmark: https://github.com/georust/rstar/pull/117
If you check out that branch or PR you can push directly to it – I'm not sure where you'd like the new_parent
change to go.
I've just finished applying your first diff, and the timing vs @rmanoka's changes (baseline is 873.54 µs) is not encouraging so far:
Note that the branch you pushed uses the original size of 4096 whereas this branch here uses 64. If I uses 4069 in both cases, I get
Bulk load complex geom time: [27.620 ms 27.653 ms 27.689 ms]
for this branch and the eager out-of-line computation
Bulk load complex geom time: [36.660 ms 36.819 ms 37.012 ms]
which is still worse, but more reasonably close.
I was just wondering whether there was an obvious blunder somewhere, given the magnitude of the difference.
If you check out that branch or PR you can push directly to it – I'm not sure where you'd like the new_parent change to go.
I don't think I can push there because I cannot push to this repository at all, but it isn't really necessary I guess.
The diff for using the pre-computed envelopes when combing the leaf nodes into a parent node is just
diff --git a/rstar/src/algorithm/bulk_load/bulk_load_sequential.rs b/rstar/src/algorithm/bulk_load/bulk_load_sequential.rs
index 6b9aa63..bc7fbcb 100644
--- a/rstar/src/algorithm/bulk_load/bulk_load_sequential.rs
+++ b/rstar/src/algorithm/bulk_load/bulk_load_sequential.rs
@@ -20,11 +20,20 @@ where
let m = Params::MAX_SIZE;
if elements.len() <= m {
// Reached leaf level
- let elements: Vec<_> = elements
+ let envelope = elements.iter().fold(
+ T::Envelope::new_empty(),
+ |mut envelope, (_element, envelope1)| {
+ envelope.merge(envelope1);
+ envelope
+ },
+ );
+
+ let children: Vec<_> = elements
.into_iter()
.map(|(element, _envelope)| RTreeNode::Leaf(element))
.collect();
- return ParentNode::new_parent(elements);
+
+ return ParentNode { children, envelope };
}
let number_of_clusters_on_axis =
calculate_number_of_clusters_on_axis::<T, Params>(elements.len());
which for me brings the benchmark down to
Bulk load complex geom time: [23.106 ms 23.137 ms 23.177 ms]
which looks like the best result so far.
EDIT: Forgot to disable frequency boost, see below for the correct results.
(The other call to ParentNode::new_parent
does not use the pre-computed envelopes, but I don't think this is an issue as it will produce parents-of-parents and those already internally "cache" their envelope.)
I am sorry but the numbers above are also incorrect as I only switched the CPU frequency governor but forgot to disable frequency boost. Doing that yields
Bulk load complex geom time: [28.556 ms 28.605 ms 28.659 ms]
using the eager computation for me which is slightly worse than the approach presented here, but comes without the indefinite increase in memory usage.
With the latest changes in #117 I'm now getting
Bulk load complex geom time: [121.15 ms 121.52 ms 121.92 ms]
change: [-0.2388% +0.2131% +0.6614%] (p = 0.35 > 0.05)
No change in performance detected.
Which is very encouraging.
Note that
diff --git a/rstar-benches/benches/benchmarks.rs b/rstar-benches/benches/benchmarks.rs
index 7de1601..5a3ec16 100644
--- a/rstar-benches/benches/benchmarks.rs
+++ b/rstar-benches/benches/benchmarks.rs
@@ -58,7 +58,7 @@ fn bulk_load_complex_geom(c: &mut Criterion) {
let polys: Vec<_> = create_random_polygons(DEFAULT_BENCHMARK_TREE_SIZE, 4096, SEED_1);
b.iter(|| {
- RTree::<Polygon<f64>, Params>::bulk_load_with_params(polys.clone());
+ RTree::<Polygon<f64>, Params>::bulk_load_with_params(polys.iter().cloned());
});
});
}
reduces the runtime further for me to
Bulk load complex geom time: [24.518 ms 24.567 ms 24.625 ms]
by avoiding to create the intermediate Vec
.
But of course, downstream code would need to make the same changes to benefit if we go for the IntoIterator
-based API.
OK, we're currently still neck-and-neck on the new bulk load benchmark, but with lower memory usage in #117.
Here's the full suite: #117 vs master
Running benches/benchmarks.rs (target/release/deps/benchmarks-f2d29554c7e4f7dd)
Bulk load baseline time: [172.08 µs 173.89 µs 176.19 µs]
change: [+6.0470% +6.9826% +7.8936%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
rstar and spade benchmarks/rstar sequential
time: [1.8192 ms 2.0567 ms 2.3794 ms]
change: [+17.971% +27.835% +41.962%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
Bulk load complex geom time: [119.35 ms 120.23 ms 121.23 ms]
change: [-82.387% -82.243% -82.102%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
bulk load quality time: [169.08 µs 170.87 µs 173.31 µs]
change: [-15.622% -8.0801% -1.2220%] (p = 0.03 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
sequential load quality time: [219.94 µs 221.89 µs 223.86 µs]
change: [-6.5737% -0.3851% +4.7198%] (p = 0.91 > 0.05)
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
locate_at_point (successful)
time: [341.55 ns 346.39 ns 353.21 ns]
change: [-2.8073% -1.4548% -0.1566%] (p = 0.03 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
locate_at_point (unsuccessful)
time: [450.39 ns 471.72 ns 506.93 ns]
change: [+3.1001% +6.1859% +9.5738%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
5 (5.00%) high mild
7 (7.00%) high severe
Here's the full suite: https://github.com/georust/rstar/pull/117 vs master
Not sure if it explains all of the regression, but I think "Bulk load baseline" would also benefit from replacing points.clone()
by points.iter().cloned()
. For anything but "Bulk load complex geom", I do not understand how the changes from #117 would improve or regress performance...
I prefer this PR to #117 because this will also ensure memoizing during query and not just bulk insertion. Isn't that a better approach in general for the r-tree? I suppose complex geoms in a r-tree is a fairly typical use-case too. For instance, we do this in the geo crate.
I wonder if we can add associated types to RTreeObject
trait to handle the memory problem. So we add default cache types that default to envelope type, but for point types, we could just make the envelope type as ()
Update: Looks like the associated types approach is not very ergonomic. For instance, assoc. types defaults are still unstable, so it's a large breaking change to do this.
Coming to think of it, I also wouldn't mind simply informing the user very clearly that the r-tree is not even looking at anything other than the envelope of the geometry, and it is probably useless to store complex geoms in the r-tree except as just data. They can use GeomWithData<AABB, GeomType>
where it is clear that the actual geometry is just a payload, and the r-tree just handles the envelope.
I prefer this PR to https://github.com/georust/rstar/pull/117 because this will also ensure memoizing during query and not just bulk insertion. Isn't that a better approach in general for the r-tree?
Also though about this, but I think the important envelopes (those of the parents) are already cached during querying in the current design. The problem really is bulk insertion where they are not available as we do not know which elements will becomes parents or leafs yet.
If the user really wants the envelope to be cached all the time, they can already do that inside of their type implementing RTreeObject
. The bulk insertion performance without caching was just a much larger performance foot gun than querying due to the above.
Coming to think of it, I also wouldn't mind simply informing the user very clearly that the r-tree is not even looking at anything other than the envelope of the geometry, and it is probably useless to store complex geoms in the r-tree except as just data. They can use GeomWithData<AABB, GeomType> where it is clear that the actual geometry is just a payload, and the r-tree just handles the envelope.
I think this is slightly too simplified: For parents we really look only at envelopes (as they obviously have no other geometry of their own) and their it is also cached. But for the leafs, some operations do look at the individual geometry, like the nearest neighbour search via the PointDistance
trait, e.g.
If we really want envelopes to be cached all the time, but without any memory overhead if it is not necessary, I think we could also change the RTreeObject
trait from
fn envelope(&self) -> Self::Envelope;
to
fn envelope(&self) -> &Self::Envelope;
which is basically impossible to implement without caching, but trivial to implement if the geometry type basically is AABB
.
But as written above, I find the current trade-off (caching only parent envelopes and leaving leaf envelopes to the implementor) preferable if we fix the bulk insertion performance foot gun, i.e. I would argue for #117.
Note that both the PRs add a 5-10% regression to the points case. I'm thinking we should aid with ergo. wrappers like GeomWithData
that cache explicitly rather than incur unnecessary extra regression for the most typical cases.
For larger redesign, I quite like the returning ref. idea:
fn envelope(&self) -> &Self::Envelope;
Definitely worth evaluating this more (separately).
So something like
struct WithCachedEnvelope<O: RTreeObject> {
object: O,
envelope: O::Envelope,
}
impl<O: RTreeObject> WithCachedEnvelope {
fn new(object: O) -> Self {
let envelope = object.envelope();
Self { object, envelope }
}
impl<O: RTreeObject> RTreeObject for WithCachedEnvelope where O::Envelope: Clone {
type Envelope = O::Envelope;
fn envelope(&self) -> Self::Envelope {
self.envelope.clone()
}
}
? I think this would be nicely composeable and would support such a change as well.
I am somewhat surprised by the statement
for the most typical cases.
though. Wouldn't one usually use the simpler k-d trees for point-like data and turn to R* trees only for more complicated geometry?
I am somewhat surprised by the statement
for the most typical cases. though. Wouldn't one usually use the simpler k-d trees for point-like data and turn to R* trees only for more complicated geometry?
My experience (which may not be universal) is that in GIS / most OGC-adjacent geo applications there isn't much distinction made between types of spatial index; there's just the spatial index library that the ecosystem provides, plus PostGIS (which I believe allows more granular control over indices, but in general isn't customised much).
My experience (which may not be universal) is that in GIS / most OGC-adjacent geo applications there isn't much distinction made between types of spatial index; there's just the spatial index library that the ecosystem provides, plus PostGIS (which I believe allows more granular control over indices, but in general isn't customised much).
Thank you for explanation, this does explain my surprise: My initial contact with spatial indexes was for simulations where I think it is not untypical to use different kinds of indexes for different kinds of geometry and/or different computational tasks. I guess being part of the georust organisation, providing a robust default choice is indeed a high priority requirement for this crate.
Yes, but I'm also conscious of the fact that our use case isn't a monopoly or even necessarily the most important. We maintain the crate, but I don't want to privilege our use cases over others, so I think we're trying to figure out how to accommodate everyone while balancing the relatively small amount of dev resources we have.
Closing since we went for #118 instead.
@urschrei Could you delete branch if you do not need it anymore?
[ ] I added an entry to
rstar/CHANGELOG.md
if knowledge of this change could be valuable to users.