apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

Improve GeoPointDistanceQuery performance [LUCENE-7663] #8714

Open asfimport opened 7 years ago

asfimport commented 7 years ago

GeoPoint queries currently use only the bounding box for filtering. But the query circle is only roughly 80% of the bounding box, so we could be roughly 20% faster. Furthermore, the current approach requires splitting the box if it crosses the date line.

> Schubert, E., Zimek, A., & Kriegel, H. P. (2013, August). Geodetic distance queries on r-trees for indexing geographic data. In International Symposium on Spatial and Temporal Databases (pp. 146-164).

The minimum spherical distance of a point to a rectangle is given ("Algorithm 2: Optimized Minimum Distance Point to MBR"). Rectangles whose minimum distance is larger than the query radius can be skipped. The authors used the R-tree, but it will work with any bounding box, so it can be used in CellComparator#relate. It's not very complex - a few case distinctions, and then either Haversine distance, or cross-track-distance. So the cost ist comparable to Haversine. This could be added as GeoRelationUtils.pointToRectMinimumDistance, for example.

This approach can also be used to priorize rectangles, for top-k search.


Migrated from LUCENE-7663 by Erich Schubert, updated Jan 28 2017

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This sounds promising!

But, we are moving away from GeoPointDistanceQuery in favor of the block KD tree (dimensional points) implemenation, LatLonPoint.newDistanceQuery: the latter is quite a bit faster as measured in our nightly geo benchmarks, and it recently just got even faster with #8707.

Do you think the ideas from these papers would also apply to LatLonPoint.newDistanceQuery?

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This approach can also be used to priorize rectangles, for top-k search.

Oh we also have a nearest neighbor implementation for points: LatLonPoint.nearest. Seems like these papers could help that too?

asfimport commented 7 years ago

Erich Schubert (migrated from JIRA)

Should apply to LatLonPointDistanceQuery as well, to reduce the number of false positives in #compare.

In NearestNeighbor, it should improve #approxBestDistance (which currently computes haversine to all four corners of the bounding box, which is incorrect for a large bounding box, and a point close to the middle point of an edge & about 4x as expensive). In fact, it would be trueBestDistance as desired in the TODOs.