Add Shape Support to BKD (extend to an R*/X-Tree data structure) [LUCENE-7110]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.67k stars 1.03k forks source link

Add Shape Support to BKD (extend to an R*/X-Tree data structure) [LUCENE-7110] #8165

Open asfimport opened 8 years ago

asfimport commented 8 years ago

I've been tinkering with this off and on for a while and its showing some promise so I'm going to open an issue to (eventually) add this feature to a future release.

R*/X-Tree is a data structure designed to support Shapes (2D, 3D, nD) where, like the internal node, the key for each leaf node is the Minimum Bounding Range (MBR - sometimes "incorrectly" referred to as Minimum Bounding Rectangle) of the shape. Inserting a shape then boils down to the best way of optimizing the tree structure. This optimization is driven by a set of criteria for choosing the appropriate internal key (e.g., minimizing overlap between siblings, maximizing "squareness", minimizing area, maximizing space usage). Query is then (a bit oversimplified) a two-phase process:

recurse each branch that overlaps with the MBR of the query shape
compute the relation with the leaf node(s) - in higher dimensions (3+) this becomes an increasingly difficult computational geometry problem

The current BKD implementation is a special simplified case of an R*/X tree where, for Point data, it is always guaranteed there will never be overlap between sibling nodes (because you're using the point data as the keys). By exploiting this property the tree algorithms (split, merge, etc) are relatively cheap (hence their performance boost over postings based numerics). By modifying the key data, and extending the tree generation algorithms BKD logic can be extended to support Shape data using the MBR as the Key and modifying split and merge based on the criteria needed for optimizing a shape-based data structure.

The initial implementation (based on limitations of the GeoAPI) will support 2D shapes only. Once the GeoAPI can performantly handle 3D shapes the change is relatively trivial to add the third dimension to the tree generation code.

Like everything else, this feature will be created in sandbox and, once mature, will graduate (likely to lucene-spatial or lucene-spatial-extras depending on the library needs).

Migrated from LUCENE-7110 by Nick Knize (@nknize), updated Jan 23 2018

asfimport commented 6 years ago

Nick Knize (@nknize) (migrated from JIRA)

After getting away from this issue for a bit to investigate using Range fields with doc values for the same purpose I wanted to raise attention back here to solicit some quick feedback.

What are the general thoughts on building on/extending the BKD/Points codec logic to create a complementary RTree/Ranges codec designed for fast bounding box index/search? The bones are already there, its just a matter of changing:

the {min/max}PackedValue logic in the reader/writer to represent bounding boxes/ranges
Add logic to handle coordinate system wrapping.

Overly simplified, of course, but this codec would effectively support spatial data only (not just spherical for lat/lon/alt, but also cartesian coordinates for 2/3D gaming). It would also likely be better suited for the existing range fields and would enable ranges to wrap the min/max boundary.

An alternative would be to add something like a rangeType property to IndexableFieldType and modify the existing Point codec to handle range encoding and coordinate system wrapping but I think that's too big of a risky hack.

Any thoughts?

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Nick, how would you compare your proposed approach with what lucene-spatial-extras SpatialPrefixTree does with the leaf node differentiator? I didn't fully understand your proposal above as it requires in-depth knowledge of the Points implementation code. The SpatialPrefixTree trailing "leaf" byte concept could, I imagine in theory, be added to the Points internals and not be very invasive/disruptive.

I think coordinate system wrapping could be handled separately and isn't strictly required.

One way to introduce the feature without fully committing to it (at first) is to have query/index intermediate code cast the underlying PointsFormat to an expected subclass with methods for these shapes. That's how the PointsAPI itself came into being – by casting the DocValuesFormat.

An alternative would be to add something like a rangeType property to IndexableFieldType and modify the existing Point codec to handle range encoding and coordinate system wrapping but I think that's too big of a risky hack.

I agree; this is too advanced/specialized. Perhaps in time as we feel out its impact.

GeoAPI

Is that an ElasticSearch thing?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

If I understand correctly, leaf nodes would store the entire shape. One thing that concerns me is that for all leaves whose MBR intersects the query, there would be a lot of per-document work since the shape would need to be deserialized and then compared with the query. I'm afraid this would scale poorly when the number of indexed shapes increases?

Using a grid like the prefix tree strategy from lucene-spatial seems like it could at least run INTERSECTS queries efficiently. Other relations sound more complicated however since they require knowledge of the entire shape, so I'm afraid they can't be implemented without a full scan?

In the end I'm wondering that the R-tree doesn't seem to bring much compared to indexing shapes with a grid approach to have a good approximation of shapes that intersect the query, and potentially storing shapes (or metadata about the shapes) in a doc-value field in order to support other relations that INTERSECTS using a two-phase iterator? I'm mentioning this in order to get feedback about whether I'm missing something, but this per-document cost still worries me and leads me to thinking that it might be saner to only support intersection?

asfimport commented 6 years ago

Nick Knize (@nknize) (migrated from JIRA)

If I understand correctly, leaf nodes would store the entire shape.

Not exactly. For shapes I still think using ranges with doc value shapes, or (as suggested) approximating w/ raster cells is the right way to go. These shapes can get awfully large and storing them as the leaf nodes I think is a non-starter. Maybe I should back up a bit. I'm actually just proposing either a new codec (or inheriting from/extending the existing Points codec) that's designed and optimized to index and search ranges/rectangles. I don't want, nor I think its worth it, to get hung up on the geo use-case or complex polygon shapes at the moment. Right now we have the BKD structure. Which is designed for point data only. I then created a franken-encoding for Range fields that basically tricks BKD into thinking dimensional ranges are points. It works, but we can do better on the tree organization for ranges by considering other criteria during split. This is really all I am and should be proposing at the moment. We can save the shape indexing and geo use cases for a later discussion. Right now I think its worthwhile sandboxing an R-Tree based codec or modification to Points. Once that's done, I'll provide some benchmarks to compare ranges encoded using the Points structure vs. encoded using the R-Tree structure. Thoughts?

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Then shouldn't we just add Range? It'd like to see it generic like points, e.g. it would support different dimensions (1D, 2D) ranges etc. I think it should be purely byte based and not bake in stuff like coordinate systems or any of that: the Query side can handle that, just like the Point queries handle it for their cases (box, radius, etc).

But I don't think we should conflate/mess with points. They already cover their use case well and we shouldn't make them more complex.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks Nick, I get it better now. I'm curious if you know about algorithms to build balanced R-trees in an offline fashion? The algorithms I have been looking at seem to assume random access since they recursively add the MBR to the sub tree that would result in the least enlargement of the MBR.

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Nick your last explanation helped clarity things a lot – thanks! So the objective is better performance for indexed rectangles over what LatLonBoundingBox is able to do today with multi-dimensional point tricks? Do you think the performance is bad enough to warrant the maintenance & complexities in what you describe? I have yet to use LatLonBoundingBox but it "seems" like it would be very efficient, even if there may be more efficient ways involving new codecs as you describe.

asfimport commented 6 years ago

Nick Knize (@nknize) (migrated from JIRA)

Some really good ideas here. Thank you all for the feedback.

Then shouldn't we just add Range? It'd like to see it generic like points, e.g. it would support different dimensions (1D, 2D) ranges etc. I think it should be purely byte based and not bake in stuff like coordinate systems or any of that: the Query side can handle that, just like the Point queries handle it for their cases (box, radius, etc).

This is exactly what I was thinking. Coordinate system agnostic and optimized for ranges, boxes, cubes, etc. I'm really looking to see what everyone thinks about it being its own companion codec to Points vs. extending Points. I like the cleanliness and safety of it being its own thing and not mucking around with or conflating Points since it already works well for its intended use case. But I didn't want to head down that path if there are reasonable performance or maintenance concerns.

I'm curious if you know about algorithms to build balanced R-trees in an offline fashion?

Oh, I like where you're going with this. There are quite a few actually. Most of the offline R-Tree builders are designed for the case where all of the data is known in advance. I think it would be super efficient to build the sub-optimal random access R-Trees on a per segment basis (since data is rarely provided at once) and in the segment merge phase use the offline approach to optimize the tree. Is that along the lines of what you were thinking?

So the objective is better performance for indexed rectangles over what LatLonBoundingBox is able to do today...

Kind of. The objective is really better performance for all indexed ranges; not just LatLonBoundingBox but sibling types as well (DoubleRange FloatRange etc.)

Do you think the performance is bad enough to warrant the maintenance & complexities in what you describe?

I'd be lying if I said I had numbers to confirm and back up "bad enough". So I'm honestly not entirely sure until I get into some benchmarking. I do know there are evil cases (as always) where we could really optimize the tree to alleviate some of the linear scan issues we occasionally see today. Another big part of this is to really create the right tree/codec for the job; Points for Points and Range for Ranges. There will be a LOT of similarities between the two (since they're both height balanced trees) but it does open us up to really tweaking the tree construction algorithm (split/merge) around better data heuristics.

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I think it would be super efficient to build the sub-optimal random access R-Trees on a per segment basis (since data is rarely provided at once) and in the segment merge phase use the offline approach to optimize the tree.

Sounds awesome! This would apply to our existing Points codec too?

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This is exactly what I was thinking. Coordinate system agnostic and optimized for ranges, boxes, cubes, etc. I'm really looking to see what everyone thinks about it being its own companion codec to Points vs. extending Points.

I don't think it should be either. I am talking about the api only: e.g. Codec.RangeFormat. I'm not particularly interested in any type of trees or anything like that, those are boring, its more about the apis being correct.

Points is for points, the name says it all :) To me its unrelated. So I would rather see a separate Range type to take care of intervals in various dimensions: date/time ranges, IPv4/IPv6 address subnets, bounding boxes, etc.