Closed asfimport closed 9 years ago
Karl Wright (@DaddyWri) (migrated from JIRA)
Here's the library itself. Integration with spatial4j will be attached shortly.
Karl Wright (@DaddyWri) (migrated from JIRA)
This is the spatial4j integration.
Karl Wright (@DaddyWri) (migrated from JIRA)
Oops, needed apache license...
David Smiley (@dsmiley) (migrated from JIRA)
Cool; I'll have to dig into this in the upcoming week. If I'm not mistaken, you are already an ASF committer for ManifoldCF and thus you already have a contributor license agreement on-file?
I took a quick peek at ShapeImpl.java. What would it take to implement the bounding-box? It's not strictly necessary but it helps performance a lot.
Karl Wright (@DaddyWri) (migrated from JIRA)
I can contribute unit tests as well. Just let me know.
Karl Wright (@DaddyWri) (migrated from JIRA)
Hi David,
I have an ICLA on file, yes.
I'll look into implementing bounding-box. It makes more sense for some shapes than for others. For some shapes it is darned near useless. It did not seem to get used at all in geohash generation.
Nick Knize (@nknize) (migrated from JIRA)
I'm interested in checking this out as well. We've got some WIP for indexing both 3&4D space-time using a hilbert-curve I'd be interested in comparing notes between the two.
As a side, David's using the bounding box to determine detailLevel for speeding up traversal of the RPT. As an FYI, distance might be a little misleading in that logic as its in units of decimal degrees.
Karl Wright (@DaddyWri) (migrated from JIRA)
If the bounding box is used for determining detail level, what I've done in the past is to set bounds for the number of geohash tokens, and descend until either you hit the maximum depth, or you hit the minimum number of desired tokens. That's of course not terribly refined, but yields a superset of the documents actually being described. By filtering at scoring time using the BoostSource technique (namely creating a FunctionQuery with a BoostSource that returns score values that will be rejected by the collector when a document is outside the shape), it's quite workable.
David Smiley (@dsmiley) (migrated from JIRA)
If the bounding box is used for determining detail level, what I've done in the past is to set bounds for the number of geohash tokens, and descend until either you hit the maximum depth, or you hit the minimum number of desired tokens.
That makes sense. It hasn't been an issue thus far because all of the existing shapes have a bounding box. Right now the traversal done in TreeCellIterator (used with CellTokenStream, via PrefixTreeStrategy) is depth-first, so it doesn't know how many cells there are going to be as it walks (it's given a target/max depth level a-priori). If it was breadth-first, then it could instead be given a minimum number of cells and then it'll go to the level just beyond if it reaches this threshold mid-way through tokenizing a level. The rationale behind it being depth-first is that I intend to reuse this on the TreeCellIterator on the search side in a refactor of AbstractVisitingPrefixTreeFilter (#6807) which underpins Intersects, Within, and now recently for heatmap & time faceting, which needs to consume the tokens in sorted order to leap-frog with TermsEnum, and so it must be depth-first.
The concept of using the cells for fast and approximate filtering and then lookup the vector geometry in, say, DocValues totally makes sense. As of last summer the spatial module has had SerializedDVStrategy to generalize the later check. It would further be awesome to have a derivative of RPT that is able to detect which cells are within the query geometry and so don't double-check those documents, and likewise for leaf cells containing the query geometry need not be double-checked either. It's a wish-list feature #6641. Ideally the leaf cells would be differentiated as edge-approximated or within but it's not essential. The optimization I'm talking about here wouldn't be appropriate when you want incorporate the distance from a mid-point of a point-radius shape in the ranking since you might as well filter at the collector as you describe.
We've got some WIP for indexing both 3&4D space-time using a hilbert-curve
Sweet! 2015 is shaping up to be an awesome year for Lucene spatial. I wish I had the ability to contribute at the levels I was supported at a couple years ago.
Karl Wright (@DaddyWri) (migrated from JIRA)
Right now the traversal done in TreeCellIterator (used with CellTokenStream, via PrefixTreeStrategy) is depth-first, so it doesn't know how many cells there are going to be as it walks (it's given a target/max depth level a-priori). If it was breadth-first, then it could instead be given a minimum number of cells and then it'll go to the level just beyond if it reaches this threshold mid-way through tokenizing a level.
That sounds correct to me. All you would need to do is to introduce a new method invocation that accepted a max depth and a min number of tokens, and hook it up to a different iterator that did what was needed.
Karl Wright (@DaddyWri) (migrated from JIRA)
As of last summer the spatial module has had SerializedDVStrategy to generalize the later check. It would further be awesome to have a derivative of RPT that is able to detect which cells are within the query geometry and so don't double-check those documents, and likewise for leaf cells containing the query geometry need not be double-checked either. It's a wish-list feature #6641. Ideally the leaf cells would be differentiated as edge-approximated or within but it's not essential.
With geo3d, the cost of determining membership in the shape is so low that I think you do better by just doing it, rather than trying to avoid doing it. In a way that was the whole point behind the development of geo3d.
For example, have a look at the GeoCircle code. To determine whether a point is within the circle, the evaluation that is needed is only the following:
`@Override`
public boolean isWithin(GeoPoint point)
{
// Fastest way of determining membership
return circlePlane.onCorrectSide(point);
}
... which involves three multiplications, three additions, and a comparison. That's cheap by any standard. The most expensive thing here is constructing the GeoPoint from (x,y,z) coordinates, because it does an allocation. Further optimization of the library can avoid even that by having an alternate form of the method: isWithin(double x, double y, double z) .
David Smiley (@dsmiley) (migrated from JIRA)
With geo3d, the cost of determining membership in the shape is so low that I think you do better by just doing it, rather than trying to avoid doing it. In a way that was the whole point behind the development of geo3d.
It's awesome the code your contributing does this so fast! But even if it were theoretically 0-cost, what isn't 0-cost is getting the points to test in the first place, wether it be from DocValues or turning the terms scanned in-progress into points. There's no telling how many points lie within the shape in the general case because the circle (or whatever shape) could be large and there could be a ton of data within it. Granted for your specific application which may have constraints on how big a shape is permitted to be and knowledge of your data density, it could very well be a non-issue for you.
Karl Wright (@DaddyWri) (migrated from JIRA)
But even if it were theoretically 0-cost, what isn't 0-cost is getting the points to test in the first place, wether it be from DocValues or turning the terms scanned in-progress into points.
Agreed, although one could argue that the solution to that problem is proper query design. Any query that results in the scoring of millions or hundreds of millions of items hasn't been well thought through. I don't think it makes sense for us to try to solve that problem in Lucene Spatial.
David Smiley (@dsmiley) (migrated from JIRA)
I can tell that you've been focused on distance boosting/ranking applications – and in that context I see where you're coming from. But it shouldn't at all be a for-lorn conclusion that the application is going to score/rank the results by distance. The query might be for analytics to compare a count with multiple other filters (e.g. time) or it might be a spatial-rich data set like "tracks" generating from GPS and other sensors matching thousands of points and once you get into the thousands, I think let alone millions, there would be benefit to avoiding thousands of DocValiues lookups (random-IO) versus reading postings for a couple terms or so known to be within the shape.
Any way; thanks again for releasing geo3d here.
Karl Wright (@DaddyWri) (migrated from JIRA)
Updating interface structure to have a distinction between the more-generic "Membership" and the more specific "Distance" capabilities.
Karl Wright (@DaddyWri) (migrated from JIRA)
Make the notion of membership be even more generic, and able to be used for relationship determination.
Karl Wright (@DaddyWri) (migrated from JIRA)
More cleanup for the generic case. Also, handle bounding boxes with greater than 180 degrees width properly.
Karl Wright (@DaddyWri) (migrated from JIRA)
Ok, the changes I made should make the geo3d library significantly more general. What I did was basically twofold: rearranged the interfaces to increase their overall ease of applicability, and second, added a missing capability: bounding boxes with longitude ranges greater than 180 degrees. In order to enforce proper usage, I also added argument range checks for all the major shape objects.
Karl Wright (@DaddyWri) (migrated from JIRA)
As a side, David's using the bounding box to determine detailLevel for speeding up traversal of the RPT. As an FYI, distance might be a little misleading in that logic as its in units of decimal degrees.
The problem with most arbitrary shapes and "bounding boxes" is twofold:
(1) Bounding boxes, if you actually look at the earth, represent bounds on the top and bottom that are not great circles. So it is very possible to have a shape that is within a bounding box, according to that box's four corners, but which curves outside of the box on the top or bottom. Therefore, computing the actual bounds of a shape is fairly expensive, compared with everything else we do in geo3d. I'll try to work out the planar math to find out exactly how expensive. (2) The stated reason to use the bounding boxes is to determine resolution. This is fine for regular shapes like circles and rectangles, but it is very wasteful for shapes that are long and thin or irregular.
For this reason, I'd much prefer that a different QuadTree getCells() variant be introduced, which (as I stated) has a cell count bound, and does a breadth-first descent. If that's not possible, then one alternative is to create a new interface in the geo3d package, which computes the lat/lon bounds for a shape – but then maybe not all shapes would implement it. That would imply two distinct spatial4j implementations as well – one for the kind of shape that does bounding boxes, one for the kind that doesn't. Stay tuned.
Nick Knize (@nknize) (migrated from JIRA)
I don't know that its necessary to determine the expense of computing bounding boxes for irregular shapes (if for only satisfying curiosity). Its a known problem with doing all computation in equirectangular lat/lon (or any single projection, for that matter) a choice is being made to sacrifice accuracy at some level. Its why there exists so many different projections for various parts of the earths surface. Spherical (or ellipsoidal) mathematics using an approximation of the earth is just plain inaccurate since there is no perfect model of the earth's shape. To achieve the best accuracy for large irregular polygon's one will either reproject the data into a more accurate projection, slice the polygon into smaller (more accurate) polygons, introduce something like a bloom filter for catching inaccurate boundary conditions, or.... insert other options here.... All of which are going to be just as, or more, expensive.
I'm all for spending time working up the fastest, most accurate logic (its a responsibility for any mature mapping or GIS utility). But it needs to be well documented and at the choice of the end user. For 95% of private sector users, great-circle is OK. For public-sector applications on the other hand.
David Smiley (@dsmiley) (migrated from JIRA)
A breadth-first feature would be great but I don't have the time right now. I will eventually, I want to do it, but not now. Computing the bounding box may be expensive from your point of view but it is at least a one-time event and could be cached. This is what CircleImpl does.
David Smiley (@dsmiley) (migrated from JIRA)
The other part missing aside from implementing Shape, to include getBoundingBox, is a SpatialContext subclass that implements the factory methods creating Geo3d variants. That is very easy. At that point; Geo3d becomes very usable with Lucene spatial, and it would be the first release of geo3d into Lucene. This is a first step; at some point work could be done to get shapes for which there are no factory methods (e.g. polygons), which would involve a little refactoring over in Spatial4j. Then we'd have WKT parsing hooked in for other shapes (polygons, etc.).
Karl Wright (@DaddyWri) (migrated from JIRA)
I've been trying to work the math; so far I haven't come up with a good general way of calculating bounds for planes intersecting the unit sphere. I will keep at it, but so far I do not have an acceptable solution. I'd be happy to attach a separate breadth-first method patch; that's much easier actually – it's just code.
Karl Wright (@DaddyWri) (migrated from JIRA)
General polygons are trivial to implement in this framework, although I haven't yet done so. Would you prefer that happen first, or finding a solution to the bounding box problem?
Karl Wright (@DaddyWri) (migrated from JIRA)
Attention math whizzes: So in order to generally solve the problem of bounding box for a shape, this is what we need.
We have a plane with equation Ax + By + Cz + D = 0. We have the unit circle, with equation x^2 + y^2 + z^2 = 1. We want to find the points (x,y,z) satisfying both equations with the maximum z and minimum z values. There should be zero, one, or two of these.
I've tried a number of things; the technique that looks like it has the best chance of success at the moment is the technique of Lagrangian multipliers. See http://en.wikipedia.org/wiki/Lagrange_multiplier . I think this would basically apply with:
g(x,y) = x^2 + y^2 + [-(A/C)x - (B/C)y - (D/C)]^2 - 1
= [(A^2/C^2)x^2 + (B^2/C^2)y^2 + (D^2/C^2) + (2AB/C^2)xy + (2AD/C^2)x + (2BD/C^2)y] + x^2 + y^2 - 1
= [1+A^2/C^2] x^2 + [1+B^2/C^2] y^2 + [(D^2/C^2)-1] + (2AB/C^2)xy + (2AD/C^2)x + (2BD/C^2)y
f(x,y) = -(A/C)x - (B/C)y - (D/C)
But it's been quite a while since my multivariable calculus days. Anyone want a nice math challenge? E.g. Mike McCandless? ;-)
Karl Wright (@DaddyWri) (migrated from JIRA)
I found a non-calculus way to do it, yielding:
z = DC +/- sqrt(D^2*C^2 + 1 - C^2 - D^2)
This should be enough to allow me to solve the problem completely.
Nick Knize (@nknize) (migrated from JIRA)
You'll want the oblate spheroid not the sphere. So don't forget the semi-axis coefficients (normalized if you're using unit normal vectors) e.g.,
(x^2+y^2)/a^2 + z^2/c^2 - 1
Karl Wright (@DaddyWri) (migrated from JIRA)
Hi Nicholas, The rest of the geo3d package uses the unit sphere. Oblate spheres would require quite a number of changes, so I'm not going there for the time being.
Nick Knize (@nknize) (migrated from JIRA)
OK. Cool. So this initial package is for a spherical system not a geodesic system then. I assume follow-on work (before lucene-spatial integration) includes correcting for error introduced by an incorrect datum.
David Smiley (@dsmiley) (migrated from JIRA)
Thus far, Lucene spatial supports a Euclidean model (for "projected" data or non-geospatial uses), and a spherical one. Supporting ellipsoidal or other models is definitely out of scope of this issue.
Nick Knize (@nknize) (migrated from JIRA)
So for a GEO context (I'm assuming "geo" implies geospatial) "spherical" operations should use the oblate spheroid (WGS84 is fine for keeping it simple). Otherwise its just a normal spherical coordinate system and geospatial operations will be erroneous and misleading to users.
Nick Knize (@nknize) (migrated from JIRA)
I note that I"m preaching to the choir... chalk it up as an enhancement. :)
Karl Wright (@DaddyWri) (migrated from JIRA)
The full point formulas for (x,y,z) for min/max z are as follows:
z = DC +/- sqrt(D^2*C^2 + 1 - C^2 - D^2)
y = -B[D+Cz] / [A^2 + B^2]
x = -A[D+Cz] / [A^2 + B^2]
For rectangles, and any shape that uses only planes going through the origin, this is all that I need. But for shapes using other planes, I need a similar formula for min/max longitude. Working on that now.
David Smiley (@dsmiley) (migrated from JIRA)
LOL Absolutely.
Karl Wright (@DaddyWri) (migrated from JIRA)
Unless I made a math error, I get the following for longitude bounds:
ADx0 + BDy0 - C^2 = 0
... combined with either:
[C^2 * D^2 * (B^2 + A^2)] * y0^2
+ [-2*C^4*BD] * y0
+ [C^6 + A^2 * C^4 + A^2 * D^2 * C^2] = 0
x0 = (-BDy0 + C^2) / AD
... or its counterpart, solving for x0:
[C^2 * D^2 * (B^2 + A^2)] * x0^2
+ [-2*C^4*BD] * x0
+ [C^6 + B^2 * C^4 + B^2 * D^2 * C^2] = 0
y0 = (-ADx0 + C^2) / BD
Either way, the minimum or maximum longitude, provided there are any solutions (x0,y0), is given by:
longitude = atan(y0/x0)
So as soon as I confirm this math, I can implement a general bounds solution for an arbitrary plane intersecting the unit sphere, which should allow me to provide bounding boxes for all shapes built on 3d bounding planes.
Karl Wright (@DaddyWri) (migrated from JIRA)
I'm completely out of time to work on this this week, and probably next week as well. Where it is left:
Picking this up again sometime mid February...
Karl Wright (@DaddyWri) (migrated from JIRA)
Had some time today, and finished the implementation. Started writing tests but not done with that yet.
Karl Wright (@DaddyWri) (migrated from JIRA)
Found some problems with the Bounds computation. I won't have time to look at these until the weekend though.
Karl Wright (@DaddyWri) (migrated from JIRA)
Problems have been fixed, and tests are reasonably comprehensive.
Karl Wright (@DaddyWri) (migrated from JIRA)
Added polygon support. Is there anything else you need before committing this code?
Michael McCandless (@mikemccand) (migrated from JIRA)
Rather than continue to iterate this to perfection (though I do love seeing all this math!), I think it would make sense to commit it and open further issues to improve it? It seems like an awesome contribution already?
Progress not perfection...
David Smiley (@dsmiley) (migrated from JIRA)
Polygons – awesome! And of course tests are essential. I'll take another look in a couple days or so. What I plan to do is enhance an RPT test (like RandomSpatialOpFuzzyPrefixTreeTest) to actually use these shapes and see if it works from that side of things. As long as these shapes can compute their relationships with a lat-lon bounding box correctly, then we should be in business.
There's one thing I want to confirm with you Karl: so these shares are "accurate" to the spherical model (the unit sphere)? That is, if I have say a line string, then is each line segment a great-circle path between its start & end? If not then can you please explain?
Nick Knize (@nknize) (migrated from JIRA)
So I went ahead and took a quick look. I do have a question worth thinking about. Is there anything in here I'm missing that's necessitating this be a part of the Lucene spatial code base? Since its essentially a computational geometry package for 3D space using a unit sphere (and not enhancing or changing the core indexing and search capabilities for geo) why not instead add it to spatial4j as is? That way it can benefit from enhancements from spatial4j (e.g., crs projections) and remain decoupled from Lucene capabilities of handling the evolving spatial indexing strategies?
David Smiley (@dsmiley) (migrated from JIRA)
I completely agree with Nick but was waiting to make my Spatial4j pitch :-) Computational geometry has nothing to do with Lucene, but Lucene has a use for tapping into computational geometry libs. I know on the dev list you mentioned you only want to contribute this to Lucene but you didn't say the reason. I'm guessing it's so that you needn't re-ask permission from your employer, whom you may have agreements for a limited number of specific open-source projects. If that's the case, you've already done your part and complied – the code is attached to this JIRA and has an ASL license. The license permits it to be incorporated into any other ASL project (e.g. Spatial4j) with ease. I didn't want to hinder you getting to this point.
Karl Wright (@DaddyWri) (migrated from JIRA)
There's one thing I want to confirm with you Karl: so these shares are "accurate" to the spherical model (the unit sphere)? That is, if I have say a line string, then is each line segment a great-circle path between its start & end? If not then can you please explain?
In part this depends on the shape. For GeoConvexPolygons, the surface paths are all great circles. For GeoCircles, it's a circle not a great circle. For GeoPaths, the boundary of the shape consists of a zone of a specific width on either side of a great circle, so the boundary is not (if you think about it) actually a great circle. For GeoBBox shapes (e.g. GeoRectangles), they are great circles in longitude, but horizontal slices in latitude. This matches the standard "rectangle" metaphor that quad trees are built on.
Karl Wright (@DaddyWri) (migrated from JIRA)
I know on the dev list you mentioned you only want to contribute this to Lucene but you didn't say the reason. I'm guessing it's so that you needn't re-ask permission from your employer, whom you may have agreements for a limited number of specific open-source projects.
That is the case.
The license permits it to be incorporated into any other ASL project (e.g. Spatial4j) with ease.
Yes, I know that. Spatial4j, however, seemed at its core to have a much different set of capabilities in mind. Spatial4j is totally wired into a latitude/longitude bounding box geometry, and most of the methods required by Shape objects have no implementation in a geo3d world. The geo3d world has exactly what is needed for lucene searching, and no more. So they are not great fits to one another, IMHO.
Nick Knize (@nknize) (migrated from JIRA)
Spatial4j is totally wired into a latitude/longitude bounding box geometry, and most of the methods required by Shape objects have no implementation in a geo3d world.
David can elaborate, but at the moment that's only a temporary delimitation of spatial4j, not a limitation. Spatial4J is intended to expand on JTS (but w/ an ASF license) for geospatial applications, not intended to remain restricted to a 2D (lat/lon) set of capabilities. That's why integrating this within S4J is an attractive approach.
The geo3d world has exactly what is needed for lucene searching, and no more.
It looks to me that any index/search approach based on a space filling curve would benefit from geo3d. In fact, what you have here can just as easily be applied to an R/R*-Tree algorithm. Where in the design is it restricted to lucene?
Karl Wright (@DaddyWri) (migrated from JIRA)
Where in the design is it restricted to lucene?
I did not claim it was restricted to Lucene. My claim is that the functionality is expressly designed to solve the geographic search problem, which to me consists of three parts: geohash construction, highly-performant result filtering, and highly-performant distance scoring functionality. A general package, in my view, is distinct in the following ways:
(1) It tends to try to solve a broader set of geographic problems, i.e. computing a shape's area, intersecting shapes, etc. (2) There is much less emphasis on the highly-performant computational requirements mentioned above; general packages by and large don't have the "expensive construction/dirt cheap individual evaluation" requirement that search engines like Lucene would have.
Having said that, I have no objection if you want to use this code in spatial4j. I just cannot contribute to spatial4j at the moment. And I do think that there is a close-enough relationship between the search problem and geo3d that it isn't unreasonable to include geo3d in Lucene.
Nick Knize (@nknize) (migrated from JIRA)
Intent and design of general utility libraries certainly vary. One thing is for sure, they all start somewhere. IMHO this is a good start to the 3D computational geospatial problem space and incubating it within Spatial4J (rc-0.4.2 or rc-0.5) would give it broader exposure / potential for greater contribution from the geospatial community.
That being said, you're the original author. I suppose its up to the community to decide where the contribution best fits? Taking into consideration, of course, the inability for you to continue contributing to one of the candidate packages.
Karl Wright (@DaddyWri) (migrated from JIRA)
Nicholas,
In the long run I doubt I will continue to have the same restrictions in place as to what I can contribute to. But for now, they are what they are, and I have to live within them.
My recommendation: Keep geo3d as a distinct library, within Lucene for the moment, but with the option of eventually spinning it off or integrating with Spatial4j. To that end, it would be ideal if it were packaged in its own jar, etc.
I would like to explore contributing a geo3d package to Lucene. This can be used in conjunction with Lucene search, both for generating geohashes (via spatial4j) for complex geographic shapes, as well as limiting results resulting from those queries to those results within the exact shape in highly performant ways.
The package uses 3d planar geometry to do its magic, which basically limits computation necessary to determine membership (once a shape has been initialized, of course) to only multiplications and additions, which makes it feasible to construct a performant BoostSource-based filter for geographic shapes. The math is somewhat more involved when generating geohashes, but is still more than fast enough to do a good job.
Migrated from LUCENE-6196 by Karl Wright (@DaddyWri), 2 votes, resolved May 06 2015 Attachments: geo3d.zip, geo3d-tests.zip, LUCENE-6196_Geo3d.patch, LUCENE-6196-additions.patch, LUCENE-6196-fixes.patch, ShapeImpl.java