cloudant-labs / clouseau

Expose Lucene features as an erlang-like node
Apache License 2.0
58 stars 32 forks source link

[Feature] Geospatial filtering based on Lucene 6 #9

Open dmunch opened 7 years ago

dmunch commented 7 years ago

Motivation

Starting from version 6.0 Lucene features effecient geospatial indexing with k-d index trees.

https://www.elastic.co/blog/lucene-points-6.0 https://www.elastic.co/blog/apache-lucene-numeric-filters

While Cloudant already offers a geospatial indexing module its implementation has one major drawback: It is not possible to combine geospatial indexing with fulltext indexing, i.e. either you query by geometry or you query by another criteria. Querying for both criteria at the same time is not possible.

Furthermore, the geospatial module is not open-sourced (yet?), i.e. it can only be used in the hosted Cloudant offer but not on premisses. While the abandoned branch 62936-geo-with-query indicates that there as been initial work to open-source the module its status is not clear.

Goal

The goal is to implement a minimal viable implementation featuring only the following basic functions:

Implementation

After initial work as described in issue #4 in porting clouseau to Java 8 and Lucene 6.3.0 we now have all the necessary conditions to implement efficient spatial search using Lucene's LatLonPoint and the corresponding functions

Two stages have to be distinguished: indexing and querying.

Indexing

Currently clouseau does not allow indexing of complex types like objects or arrays. A necessary step would be to allow arrays to be sent from dreyfus to clouseau. With some luck and thanks to scalang this involves little or tiny work in ClouseauTypeDecoder.

Once clouseau receives an array a very first implementation could only handle the case of an array with only two elements and index it as LatLonPoint, the first element being the latitude, the later the longitude.

Note that an array might also be indexed as a multi-dimensional DoublePoint. Eventually this might become the default case with LatLonPoint being a special case and being forced by an entry in the options map or vice versa.

In a third step complex types might be enabled as well, which would allow for GeoJSON to be send from dreyfus to clouseau. This however would involve further work and is certainly out of scope for this minimal proof of concept.

Note that most certainly most of this functionality can be implemented solely by modifications to clouseau.

Querying

In order to be able to query fields indexed as LatLonPoint another field has to be introduced in the querying endpoint of dreyfus. For the sake of simplicity I would call this field qdsl and let its syntax be inspired from the ElasticSearch Geo Queries syntax.

This reason to call it qdsl is that it might support the more generic Elasticsearch Query DSL at some point in the future.

As a simple mashup this could look like the following JSON.

{
  "q": "index:my query",
  "sort": "foo",
  "limit": 3,
  "qdsl": {
    "geo_distance" : {
        "distance" : "200km",
        "fieldname" : {
            "lat" : 40,
            "lon" : -70
        }
    }
  }
} 

Dreyfus would recognize the qdsl field and pass it on to clouseau. Clouseau would do the heavy lifting, parse the JSON, construct the corresponding Lucene Query and combine it by means of a BooleanQuery MUST with the query obtained from q.

Changes to dreyfus would be minimal while additions to clouseau are still moderate.

Conclusion

The necessary codechanges involved in bringing Lucene geospatial filtering to clouseau are moderate and rather straight forward. Different roads can be taken on how to index point fields and query the geospatial index, one is outlined in text above. Care was token to chose an approach with an expected minimal implementation effort while still providing most of the additional value introduced with LatLonPoint in Lucene 6.

This proposition should currently be seen as a draft and is open for comments and suggestions.

dmunch commented 7 years ago

For those following this feature, there is a first successful proof-of-concept implementation available in the following branches:

https://github.com/dmunch/clouseau/tree/dmunch-lucene6-latlonpoint https://github.com/dmunch/dreyfus/tree/dmunch-lucene6-latlonpoint