Open dmunch opened 7 years ago
For those following this feature, there is a first successful proof-of-concept implementation available in the following branches:
https://github.com/dmunch/clouseau/tree/dmunch-lucene6-latlonpoint https://github.com/dmunch/dreyfus/tree/dmunch-lucene6-latlonpoint
Motivation
Starting from version 6.0 Lucene features effecient geospatial indexing with k-d index trees.
https://www.elastic.co/blog/lucene-points-6.0 https://www.elastic.co/blog/apache-lucene-numeric-filters
While Cloudant already offers a geospatial indexing module its implementation has one major drawback: It is not possible to combine geospatial indexing with fulltext indexing, i.e. either you query by geometry or you query by another criteria. Querying for both criteria at the same time is not possible.
Furthermore, the geospatial module is not open-sourced (yet?), i.e. it can only be used in the hosted Cloudant offer but not on premisses. While the abandoned branch 62936-geo-with-query indicates that there as been initial work to open-source the module its status is not clear.
Goal
The goal is to implement a minimal viable implementation featuring only the following basic functions:
LatLonPoint
Implementation
After initial work as described in issue #4 in porting clouseau to Java 8 and Lucene 6.3.0 we now have all the necessary conditions to implement efficient spatial search using Lucene's
LatLonPoint
and the corresponding functionsnewBoxQuery()
newDistanceQuery()
newPolygonQuery()
nearest()
Two stages have to be distinguished: indexing and querying.
Indexing
Currently clouseau does not allow indexing of complex types like objects or arrays. A necessary step would be to allow arrays to be sent from dreyfus to clouseau. With some luck and thanks to
scalang
this involves little or tiny work inClouseauTypeDecoder
.Once clouseau receives an array a very first implementation could only handle the case of an array with only two elements and index it as
LatLonPoint
, the first element being the latitude, the later the longitude.Note that an array might also be indexed as a multi-dimensional
DoublePoint
. Eventually this might become the default case withLatLonPoint
being a special case and being forced by an entry in the options map or vice versa.In a third step complex types might be enabled as well, which would allow for GeoJSON to be send from dreyfus to clouseau. This however would involve further work and is certainly out of scope for this minimal proof of concept.
Note that most certainly most of this functionality can be implemented solely by modifications to clouseau.
Querying
In order to be able to query fields indexed as
LatLonPoint
another field has to be introduced in the querying endpoint of dreyfus. For the sake of simplicity I would call this fieldqdsl
and let its syntax be inspired from the ElasticSearch Geo Queries syntax.This reason to call it
qdsl
is that it might support the more generic Elasticsearch Query DSL at some point in the future.As a simple mashup this could look like the following JSON.
Dreyfus would recognize the
qdsl
field and pass it on to clouseau. Clouseau would do the heavy lifting, parse the JSON, construct the corresponding LuceneQuery
and combine it by means of aBooleanQuery
MUST
with the query obtained fromq
.Changes to dreyfus would be minimal while additions to clouseau are still moderate.
Conclusion
The necessary codechanges involved in bringing Lucene geospatial filtering to clouseau are moderate and rather straight forward. Different roads can be taken on how to index point fields and query the geospatial index, one is outlined in text above. Care was token to chose an approach with an expected minimal implementation effort while still providing most of the additional value introduced with
LatLonPoint
in Lucene 6.This proposition should currently be seen as a draft and is open for comments and suggestions.