Open asfimport opened 8 years ago
Karl Wright (@DaddyWri) (migrated from JIRA)
I looked at ... class LatLonPointDistanceComparator extends FieldComparator<Double> implements LeafFieldComparator {
... briefly. This implementation is clearly not general at all for search shape. But I think one could write a geo3d equivalent that would work for any GeoDistance implementation with a provided DistanceStyle. This would also serve to find objects within the implied Membership implementation, because the contract for GeoDistance calls for it to return MAX_VALUE when the point is outside the shape, so it would work beautifully.
The big problem is that there's not much performance gain possible because there's currently no geo3d general means of going from a GeoDistance object and distance measurement to a bounding XYZSolid. That is, there's no equivalent to Rectangle.fromPointDistance(). Solving that is a harder problem and would require extensions to all the shape objects that implement GeoDistance. But I think that would clearly be step 2 anyway.
Geo3DPointDistanceComparator would be constructed, therefore, with a GeoDistance object and a DistanceStyle object, and would call GeoDistance.computeDistance() using these parameters when it needed to find a distance. The decoding/encoding of XYZ values is most of the class.
Michael McCandless (@mikemccand) (migrated from JIRA)
Maybe we could start with the equivalent of LatLonPoint.newDistanceSort
? This will sort all hits according to their earth-as-sphere-surface-distance (haversine) from the provided point?
It's really cool if geo3d can also get you distance from any shape, so maybe it could have two newDistanceSort
methods, one taking lat/lon, another taking a GeoShape
?
Ideally we wouldn't need to expose things like GeoDistance
and DistanceStyle
to the user? Can these somehow be under-the-hood implementation details?
Karl Wright (@DaddyWri) (migrated from JIRA)
Ideally we wouldn't need to expose things like GeoDistance and DistanceStyle to the user? Can these somehow be under-the-hood implementation details?
The classes that correspond to LatLonPointDistanceComparator and LatLonPointSortField will need to have a GeoDistance and DistanceStyle passed to them.
The newDistanceSort method would need to be equivalent to the LatLonPoint API, so it would construct a GeoCircle that occupies the whole world with the specified center, and use a standard arc distance DistanceStyle. But the problem is that such a query is completely unconstrained without the ability to go from shape/distance to XYZSolid bound, so it would not perform reasonably in that form. If you introduced an upper bound to the circle radius as an additional argument, then it would work fine without additional geo3d code.
I would in general think that all "newXXXQuery" methods would have a corresponding "newXXXSort" method, if the underlying shape implements GeoDistance. So there would be "newDistanceSort" and "newPathSort", but not "newBoxSort" or "newPolygonSort". The arguments would have to correspond exactly, though, to "newXXXQuery", including distance bounds, for the geo3d API.
Karl Wright (@DaddyWri) (migrated from JIRA)
FWIW, the way I'd add bounds computation to shapes by distance would be as follows:
(1) Add a method to the GeoDistance interface:
void getDistanceBounds(DistanceStyle ds, Bounds xxx, double distanceValue);
The purpose of this method would be to compute the bounds for the current shape where the chosen distance metric was less than the specified distanceValue. (2) Add this method to the following classes:
GeoBaseDistanceShape GeoDegeneratePoint GeoStandardCircle GeoStandardPath
The base class implementation would check if distanceValue was equal to MAX_VALUE, and if so would simply compute the bounds for the current shape. If less than that value, it would call an abstract method that would do the job. For GeoStandardCircle and GeoStandardPath, the approach taken for doing the actual computation would involve newly constructing a constrained GeoStandardCircle or GeoStandardPath, and then computing the bounds from that. There would likely be new constructors, therefore, which would hopefully repurpose planes etc. if possible from the starting GeoStandardCircle or GeoStandardPath. It may be easier code-wise to just build the new shapes from scratch, though.
Creating a new complex object is expensive to do both mathematically and in terms of object creation. It really cannot be easily avoided, however. That means that when you call getDistanceBounds() you should expect it to be expensive enough that you wouldn't want to do it on every document addition, or even every 10.
Robert Muir (@rmuir) (migrated from JIRA)
How expensive are geo3d's distance formulas though?
The 2D distance comparator examines the least-competitive element of the priority queue when it changes, to create a bounding box that can quickly reject hits that won't compete anyway in a cheaper way than invoking haversin distance.
In the worst case, this will happen for every document, so it has a guard in the code to start doing it less often (only every Nth update) if it gets called too many times.
The optimization helps for two reasons in 2D:
For 3D the tradeoffs may be different.
Karl Wright (@DaddyWri) (migrated from JIRA)
How expensive are geo3d's distance formulas though?
It varies – that's one of the reasons there are different DistanceStyle implementations. The cheapest (squared linear distance) is quite cheap; the most expensive (arc distance) requires several Math.acos invocations (for a path).
Robert Muir (@rmuir) (migrated from JIRA)
Also the encoding plays a role: to support sorting at all geo3d needs to figure out how to encode it in docvalues.
Does geo3d also need two-phase query support? If so this also relates to how to encode in docvalues (and whether its necessary or maybe should be optional or any of that). We don't benchmark this at all, but it seems to be almost necessary for 2D to meet user expectations, so 2D requires it. In other words a distance or poly query may be slow, but if they restrict the dataset with other filters then its less slower as it defers per-hit computations until its truly needed. If this is not needed for 3D, it might simplify how to support sorting as any encoding need only be geared at that.
Karl Wright (@DaddyWri) (migrated from JIRA)
Does geo3d also need two-phase query support?
Probably. The same arguments apply here for 3D vs 2D. But I don't know what it entails.
Robert Muir (@rmuir) (migrated from JIRA)
It varies – that's one of the reasons there are different DistanceStyle implementations. The cheapest (squared linear distance) is quite cheap; the most expensive (arc distance) requires several Math.acos invocations (for a path).
I would start simple as mike suggests and not try to make a generic solution: maybe just support one of them as a start. Sorting can be a real hotspot: look at all the craziness the 2D one does to speed up/avoid haversin computations. The best solution is very much tied to both the formula in question and the underlying encoding.
Karl Wright (@DaddyWri) (migrated from JIRA)
Also the encoding plays a role: to support sorting at all geo3d needs to figure out how to encode it in docvalues.
Right, some six months ago I attempted to code a packing of (X,Y,Z) into a single integer, using knowledge of the IEEE floating point spec to make this fast. I'll have to find or reconstruct this. What characteristics does this packing need to have in order to function properly in a sorted environment? In a two-phase environment?
Robert Muir (@rmuir) (migrated from JIRA)
This is why i asked if we need two-phase. Maybe we need to extend the benchmark first to understand if its really necessary?
When i look at what geo3d computes per-doc for a distance query, it seems to look much less expensive (e.g. just some multiplication) and I think this is also why its faster at distances. For polygons we need to investigate too (these get slower, linearly, as polygon complexity increases with 2D today).
If geo3d doesn't have these problems, and doesn't really need two-phase, then we simplify the problem: and what goes in there need not be 100% consistent with the query (ok, it could cause some confusion if its not, but it wont break things). This means the encoding could be different, more lossy, etc.
Karl Wright (@DaddyWri) (migrated from JIRA)
This is why i asked if we need two-phase. Maybe we need to extend the benchmark first to understand if its really necessary?
+1 to that. I'm operating in the dark at the moment. I presume in order to do that we have to have a naive implementation available?
When i look at what geo3d computes per-doc for a distance query, it seems to look much less expensive (e.g. just some multiplication) and I think this is also why its faster at distances. For polygons we need to investigate too (these get slower, linearly, as polygon complexity increases with 2D today). If geo3d doesn't have these problems, and doesn't really need two-phase, then we simplify the problem:
geo3d polygon performance will still degrade linearly with the number of sides, but your assessment is accurate for the cheapness of evaluating any one of those sides, yes, and of the cheapness of evaluating circle membership.
and what goes in there need not be 100% consistent with the query (ok, it could cause some confusion if its not, but it wont break things). This means the encoding could be different, more lossy, etc.
Ok, you've lost me here, but that's my problem. :-)
Robert Muir (@rmuir) (migrated from JIRA)
+1 to that. I'm operating in the dark at the moment. I presume in order to do that we have to have a naive implementation available?
We just need to issue queries in the benchmark that are e.g. AND'd with a filter that only matches X% of the documents. This is a good thing to benchmark anyway, its far more realistic to expect there is more to the query than just a spatial component (e.g. "pizza near my house").
Lemme try to dig into this...
Karl Wright (@DaddyWri) (migrated from JIRA)
As for encoding, we basically get 20 bits per x, y, or z value. If proximity is a requirement, then 3D morton encoding can be used:
... but maybe that's not needed. The twenty bits I would extract from the IEEE double, after scaling by the planetmodel max distance in the corresponding dimension. So we'd know that the max value would be 1.0, and the min value would be -1.0, and thus we can select the most significant mantissa bits. Or maybe it's easier to just multiply and Math.floor. We'll see. But 2^20 \~= 10^6, so precision is not great, but probably enough.
Robert Muir (@rmuir) (migrated from JIRA)
I ran some rough benchmarks. its tough because the filter we use is itself costly :) Also i use an ancient jvm and hardware and my numbers are in general very different from what mike reports.
2D distance with 100% filter: 16.0 QPS
2D distance with 10% filter: 24.1 QPS
2D distance with 1% filter: 42.7 QPS
2D distance with 0.1% filter: 51.1 QPS
2D distance with 0.0001% filter: 55.7 QPS
3D distance with 100% filter: 21.7 QPS
3D distance with 10% filter: 31.9 QPS
3D distance with 1% filter: 36.6 QPS
3D distance with 0.1% filter: 38.2 QPS
3D distance with 0.0001% filter: 42.3 QPS
For distance: I'm not sure geo3d needs two-phase support: i don't think its per-doc checks are so expensive? For 2D we should reinvestigate it too: maybe it has outgrown the need for a "two-phase crutch". Maybe its really overkill after #8202 since there are no longer tons of per-doc boundary cases being evaluated. And there is truly some cost for "throwing away" the data points that BKD sends us and then suffering docvalues reads later to fetch them from a different storage.
For polygons: its hard for me to see. Both 2D and 3D get really slow as polygon complexity increases. For 2D the per-doc check is O(#vertices). But 2D also has an expensive up-front cost (grid construction) that you can see bounds the performance of these complex queries regardless of how any hits they have. For 3D i'm not sure where the heavy cost lies with these. It might not be related to per-doc checks at all.
Polygon (n=500)
2D poly with 100% filter: 7.2 QPS
2D poly with 10% filter: 16.1 QPS
2D poly with 1% filter: 25.9 QPS
2D poly with 0.1% filter: 28.6 QPS
2D poly with 0.0001% filter: 29.8 QPS
3D poly with 100% filter: 1.9 QPS
...
3D poly with 0.0001% filter: 2.0 QPS
This is my commit: https://github.com/mikemccand/luceneutil/commit/4e22c6abeedb00bf6de66d4c3fbefbd95cb26b98
Anyway take the numbers with a grain of salt. It might be worthwhile to investigate what is happening with polygons with geo3d though, i think something is not right there. Lots more experiments needed...
Karl Wright (@DaddyWri) (migrated from JIRA)
It might be worthwhile to investigate what is happening with polygons with geo3d though, i think something is not right there.
Something is definitely funny. Did you try this with prebuilt queries, or without? Query construction is likely to be slow, and this is known – it's an O(N^2) algorithm in places.
@mikemccand: any idea why these numbers would be so different from the ones you reported earlier?
Robert Muir (@rmuir) (migrated from JIRA)
Well i dont think we should do "prebuilt queries". If its expensive to do up-front stuff this should be part of the cost. 2D has a prebuilding too at the moment, it happens in createWeight though. But to an end user, its still latency.
Michael McCandless (@mikemccand) (migrated from JIRA)
Michael McCandless: any idea why these numbers would be so different from the ones you reported earlier?
I think I tested 10-gons in my previous tests? Seems like 500-gons add quite a bit more cost?
Karl Wright (@DaddyWri) (migrated from JIRA)
Well i dont think we should do "prebuilt queries". If its expensive to do up-front stuff this should be part of the cost. 2D has a prebuilding too at the moment, it happens in createWeight though. But to an end user, its still latency.
It's latency that can be controlled, though - the typical use case for very large polygons is for them to be associated with countries etc., and there are not many of those, and they can be prebuilt. So I think it is worth considering the two costs separately.
I can make polygon building much faster if I know that you are handing me well-formed convex polygons (or concave ones). The time comes from tiling irregular polygons in a way that doesn't violate convexity/concavity constraints. But that changes the API quite a bit then. :-)
Robert Muir (@rmuir) (migrated from JIRA)
No need to change the API. detecting these things can be done in linear or n log n time. I continue to insist that users do not need to concern themselves with this stuff.
Karl Wright (@DaddyWri) (migrated from JIRA)
I think I tested 10-gons in my previous tests? Seems like 500-gons add quite a bit more cost?
Of course they do. If order N then it would be roughly 50x as much time to build a query. Since this is O(N^2) its up to 2500x worse, approximately.
I still maintain that anyone who is constructing a 500-point polygon for each query in real life needs to have their head examined.
Karl Wright (@DaddyWri) (migrated from JIRA)
Well, go ahead, have a look at that code then. It's all in GeoPolygonFactory. I'm open to suggestions.
Karl Wright (@DaddyWri) (migrated from JIRA)
detecting these things can be done in linear or n log n time.
Ok, this code snippet from GeoConvexPolygon does nothing more than confirm that a polygon is in fact convex in the 3D world. Can you point me at a resource that would describe how to do this in O(n log n) time? The actual checks that are done in GeoPolygonFactory are a good deal more complex than this but one algorithm would likely solve both problems.
In a nutshell, the problem in the 3D planar world is that relationships are not transitive. If you show that a point is within an edge, and you construct a new edge with it, and then you find a subsequent point, there's no guarantee that the newest point will be within the oldest edge.
As a practical matter, this needs to be done only once (in GeoPolygonFactory), and this check can go away from GeoConvexPolygon since it's now package private. But that does not stop the need to perform a check of this kind and break up the original polygon accordingly.
// In order to naively confirm that the polygon is convex, I would need to
// check every edge, and verify that every point (other than the edge endpoints)
// is within the edge's sided plane. This is an order n^2 operation. That's still
// not wrong, though, because everything else about polygons has a similar cost.
for (int edgeIndex = 0; edgeIndex < edges.length; edgeIndex++) {
final SidedPlane edge = edges[edgeIndex];
for (int pointIndex = 0; pointIndex < points.size(); pointIndex++) {
if (pointIndex != edgeIndex && pointIndex != legalIndex(edgeIndex + 1)) {
if (!edge.isWithin(points.get(pointIndex)))
throw new IllegalArgumentException("Polygon is not convex: Point " + points.get(pointIndex) + " Edge " + edge);
}
}
}
Karl Wright (@DaddyWri) (migrated from JIRA)
I've moved the polygon construction cost discussion to #8271.
Robert Muir (@rmuir) (migrated from JIRA)
This is the one i was investigating for 2D: https://en.wikibooks.org/wiki/Algorithm_Implementation/Geometry/Convex_hull/Monotone_chain
But there are some steps/cleanups before this can be done/used in a reasonable fashion
Karl Wright (@DaddyWri) (migrated from JIRA)
For 2D, there's quite a bit to choose from, e.g. the O( N ) cross-product solution described here: http://stackoverflow.com/questions/471962/how-do-determine-if-a-polygon-is-complex-convex-nonconvex
For 3D it's more challenging. I'll keep thinking about it, of course...
Robert Muir (@rmuir) (migrated from JIRA)
Yeah, for this issue, it would be great to just confirm the cost is really an up-front issue for polygons. As long as we are confident in that (and it looks that way to me), then we can continue here unblocked.
I have the feeling thats all it is, and that Geo3D doesn't really need two-phase iteration. (Hopefully 2D won't need it either, if it can clean up its polygon act to not be trappy).
So I think this issue can just focus on docvalues for distance?
As far as LatLonPoint goes, if we can remove the need for it to use docvalues in polygon queries, I think its better to have a separate *Field for going into docvalues.
It means that the docvalues become optional, just like they are for the other fields in lucene. You add it to your doc too if you want sorting, faceting, expressions, etc, but you don't need it if you are just doing searches, and vice-versa: you don't need the multi-dimensional ports index if you are just doing sorting/faceting/expressions. so the DocValues field really is a completely unrelated thing. We might want to do it the same way for 3D.
Karl Wright (@DaddyWri) (migrated from JIRA)
So I think this issue can just focus on docvalues for distance?
+1
We might want to do it the same way for 3D.
I'm happy to follow your lead. I was going to model the 3D Comparator/Sort code on LatLon's. If you want to make structural changes first, let me know and I'll hold off until you're ready.
Robert Muir (@rmuir) (migrated from JIRA)
I don't think the SortField/Comparator would change at all. Its just if we can make polygon queries not need a "two-phase crutch" then docvalues are no longer needed for queries. So we'd add something like LatLonDocValuesField (ugh, will mull on the name) and move the newSortField() method to it, and turn off the docvalues in LatLonPoint: its no longer needed for queries.
newSortField() and any other docvalues-oriented methods (e.g. getting a #6389 "stream" of latitudes or longitudes for expressions integration) would be on that LatLonDocValuesField: it just concerns itself with docvalues, and the LatLonPoint just concerns itself with the points stuff.
Just like the separation of DoublePoint vs DoubleDocValuesField in core/.
Karl Wright (@DaddyWri) (migrated from JIRA)
I'm still holding off on doing anything here for two reasons: (1) I need to finish figuring out the math of producing a bound that is constrained by distance for geo3d, and (2) I want to wait until the docvalues field is pulled out of the corresponding LatLon classes before introducing geo3d classes.
If I'm not very much mistaken, (1) is crucial for this functionality to be a performance win, so I need to make sure the math works first. I whacked away at it for a good chunk of my non-meeting hours yesterday before getting distracted by general large polygon performance concerns, but I ran into trouble with expression complexity related to the ellipsoid nature of WGS84. I probably need to transform the problem in some way to make it tractable, and I tried one way already that looked like it would work, but it wound up still making too messy an expression to deal with. Working more on this this weekend...
Robert Muir (@rmuir) (migrated from JIRA)
ok but #2
should take some time. i sorta see it as really important "sandiness" to fix though :). in the meantime if you want to proceed with 3D, and come up with e.g. a 64-bit encoding, I can help get the docvalues field together or whatever you need.
As far as if #1
is crucial, i think you have to do some experiments to find out? In 2D we know the bounding box check is a big win. This optimization is a very tricky balance because computing that is itself expensive, and users do crazy things (sort with enormous top N's or somehow data could be in adversarial sort order causing shittons of priority queue updates). The situation for 3D may be different.
Karl Wright (@DaddyWri) (migrated from JIRA)
Ok, I have the math worked out for one metric: arc distance. But now additional unrelated problems have landed on my plate... looking at those first. If I get done with them before you're ready with #2, I'll get started.
ASF subversion and git services (migrated from JIRA)
Commit 23422908165f62581c271524955af2ab0e6e069f in lucene-solr's branch refs/heads/master from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2342290
LUCENE-7212: Structural changes necessary to support distance-limited bounds.
ASF subversion and git services (migrated from JIRA)
Commit 4dd7c0c9191b0314e6de464f3998ae09ed859f54 in lucene-solr's branch refs/heads/branch_6x from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4dd7c0c
LUCENE-7212: Structural changes necessary to support distance-limited bounds.
Michael McCandless (@mikemccand) (migrated from JIRA)
So the 2D world has now separated out doc values (LatLonDocValuesField
), used if you need distance sorting, from points (LatLonPoint
), used if you need filtering by shape.
It sounds like 3D can likely do the same? I.e., we don't need two-phase support for geo3d shapes since the per-hit check is fast?
In which case ... we can do a more lossy (x/y/z -> 64 bits) encoding, solely for distance sorting? Will the geo3d distance checking APIs be upset about the heavy quantization (21 bits per value, instead of 32 bits per value points is using)? I know we've had issues with "the point is not on the earth's surface" already with 32 bits per x/y/z.
Alternatively, we could maybe use BINARY dvs, and keep 32 bits per component ... but this gets messy with multi-valued fields. geo2d (LatLonDocValuesField
) does support multi-valued fields but I wonder how necessary that really is ...
Karl Wright (@DaddyWri) (migrated from JIRA)
@mikemccand, the resolution is quite important for determination if in-set/out-of-set for shapes like small circles. However, for distance measurement/ordering, I can't see any need to be that precise, so if the docvalues is only used for distance computation then 64 bits should be OK.
Karl Wright (@DaddyWri) (migrated from JIRA)
Here's a geo3d doc values implementation.
ASF subversion and git services (migrated from JIRA)
Commit 351878223ddbb31722ee00882a214bf252378c65 in lucene-solr's branch refs/heads/master from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3518782
LUCENE-7212: Add geo3d doc values field
ASF subversion and git services (migrated from JIRA)
Commit 0e9ec477d76da308e2a2d16e4aab5e42ae5ad820 in lucene-solr's branch refs/heads/branch_6x from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0e9ec47
LUCENE-7212: Add geo3d doc values field
Karl Wright (@DaddyWri) (migrated from JIRA)
@mikemccand: Here's a further patch reorganizing Geo3D a bit and adding sort fields for paths and circles.
ASF subversion and git services (migrated from JIRA)
Commit 07af00d8e7bc4ce2820973e2ab511bfe536654c6 in lucene-solr's branch refs/heads/master from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=07af00d
LUCENE-7212: Add Geo3D sorted document fields.
ASF subversion and git services (migrated from JIRA)
Commit 8a407f0399c6575d6f4bb087f1d9fdc7d112e5d2 in lucene-solr's branch refs/heads/branch_6x from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8a407f0
LUCENE-7212: Add Geo3D sorted document fields.
ASF subversion and git services (migrated from JIRA)
Commit 417c37279e91954c105f6d70e0f863cd28bf0682 in lucene-solr's branch refs/heads/master from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=417c372
LUCENE-7212: Add tests for encoding/decoding.
ASF subversion and git services (migrated from JIRA)
Commit ace032ea2645e3e25f38aa3631c6e2ef0422e801 in lucene-solr's branch refs/heads/branch_6x from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ace032e
LUCENE-7212: Add tests for encoding/decoding.
ASF subversion and git services (migrated from JIRA)
Commit 2a810938bad25438b7c3474e1ad2c5d3500cdb31 in lucene-solr's branch refs/heads/master from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2a81093
LUCENE-7212: Rename the public method for path sort fields.
ASF subversion and git services (migrated from JIRA)
Commit a13428e0082912b27ecf0642c30e8d441b2ea8f6 in lucene-solr's branch refs/heads/branch_6x from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a13428e
LUCENE-7212: Rename the public method for path sort fields.
ASF subversion and git services (migrated from JIRA)
Commit cc263413457c826914c3f3c1c26f885330a931aa in lucene-solr's branch refs/heads/master from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc26341
LUCENE-7212: Add outside distance classes and methods.
ASF subversion and git services (migrated from JIRA)
Commit aa339a2f0756a487293c271a673e60e12722d661 in lucene-solr's branch refs/heads/branch_6x from @DaddyWri https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aa339a2
LUCENE-7212: Add outside distance classes and methods.
Karl Wright (@DaddyWri) (migrated from JIRA)
What's still needed for this ticket:
Geo3D has a number of distance measurements and a generic way of computing interior distance. It would be great to take advantage of that for queries that return results ordered by interior distance.
Migrated from LUCENE-7212 by Karl Wright (@DaddyWri), updated May 18 2016 Attachments: LUCENE-7212.patch (versions: 2)