apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Add GeoPointDistanceQuery for GeoPointField type [LUCENE-6532] #7590

Open asfimport opened 9 years ago

asfimport commented 9 years ago

[LUCENE-6481 | https://issues.apache.org/jira/browse/LUCENE-6481] adds GeoPointField w/ GeoPointInBBox and GeoPointInPolygon queries. This feature adds GeoPointDistanceQuery to support point radius queries.


Migrated from LUCENE-6532 by Nick Knize (@nknize), updated Feb 01 2016 Attachments: LUCENE-6532.patch (versions: 2)

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

Initial work in progress. This first cut is super super slow, but gets the math foundation in the open for review.

Notable additions:

Separately I've started adding these computations to BKDTree. Point radius queries on BKDTree should be far faster than simply using the Terms Dictionary and Postings list as the KD-Tree structure is naturally organized by location. This should enable us to stop traversing the tree once we've found an inner node that matches the distance query. I'll file a separate issue for this feature and work it in tandem.

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

60% performance improvement over previous patch. Notable changes include:

Point-radius circles that cross the dateline is currently a limitation. Will add logic to support this capability on a future iteration (most likely a separate issue that will add dateline crossing support to PointInBBox, PointInPoly, and PointDistance query classes).

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @nknize this is some heavy math :)

I wonder if we should provide a Haversine impl (default?) that's faster but a bit off (22 meters from your comment)?

Do you have a simple benchmark somewhere (you mentioned a 60% speedup)?

In the 2nd patch, do you skip the polygon approximation entirely? It's curious that made things so much faster; it seems like that should have avoided a large number of per-hit distance calculations.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

...(22 meters from your comment)?

Oh is it 22 km? That's different I guess :)

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

I wonder if we should provide a Haversine impl (default?) that's faster but a bit off (22 meters from your comment)?

This is an improvement area probably worth iterating over time. There are quite a few distance formulae we can add (and optimize) that provide different speed/accuracy trade-offs. At the moment I've only added vincenty (without the faster helmert expansion parameter) intended for small search areas. The simple Haversine/law of cosine implementation was already available in SloppyMath and for the most common distance search use case is quite fast and gives reasonable results (usually about 0.3% error). On the other hand, for highly accurate distance calculations, Vincenty (when it converges to E-12) accuracy is excellent at \~0.06mm - about 10mm more accurate than the morton encoding being used to index the points. For small search areas (county level) the convergence performance is comparable to haversine, but for large areas can take an unreasonably long time, or even fail on nearly antipodal points. This is why the patch is currently using Haversine. As a future improvement it might be worthwhile adding the Lambert distance formula for large search areas. As a comparison point, Lambert gives \~10m accuracy at 20,000 km distances where haversine error can be as much as 2km.

Do you have a simple benchmark somewhere (you mentioned a 60% speedup)?

Not presently in the patch. I used System time calculations in a few contrived tests. I actually didn't need to go that far as the performance gain was seconds better using ant test in verbose mode.

In the 2nd patch, do you skip the polygon approximation entirely?

Yes. I skipped the process of converting the point-radius to an approximated polygon and simplified everything by adding rectCrossesCircle and rectWithinCircle. This made everything faster because instead of computing rectCrossesPolygon (which requires 4*nEdges iterations) the rectCrossesCircle method, worst case, requires 4 quadratic computations to determine whether a cell crosses the circle. (I'm really glad you asked this because I just realized I have some unnecessary haversine computations in that method. Removing those and will add a new patch).