apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

Add Points Based Shape Indexing [LUCENE-8396] #9443

Closed asfimport closed 5 years ago

asfimport commented 6 years ago

I've been tinkering with this for a while and would like to solicit some feedback. I'd like to introduce a new shape field based on the BKD/Points codec to bring much of the Points based performance improvements to the shape indexing and search usecase. Much like the existing shape indexing in spatial-extras the shape will be decomposed into smaller parts, but instead of decomposing into quad cells (which have the drawback of precision accuracy and sheer volume of terms) I'd like to explore decomposing the shapes into a triangular mesh; similar to gaming and computer graphics. Not only does this approach reduce the number of terms, but it has the added benefit of better accuracy (precision is based on the index encoding technique instead of the spatial resolution of the quad cell).

For better clarity, consider the following illustrations (of a polygon in a 1 degree x 1 degree spatial area). The first is using the quad tree technique applied in the existing inverted index. The second is using a triangular mesh decomposition as used by popular OpenGL and javascript rendering systems (such as those used by mapbox).

polyWHole.png

Decomposing this shape using a quad tree results in 1,105,889 quad terms at 3 meter spatial resolution.

tessellatedPoly.png

Decomposing using a triangular mesh results in 8 triangles at the same resolution as encodeLat/Lon.

The decomposed triangles can then be encoded as a 6 dimensional POINT and queries are implemented using the computed relations against these triangles (similar to how its done with the inverted index today).


Migrated from LUCENE-8396 by Nick Knize (@nknize), 2 votes, resolved Nov 01 2018 Attachments: LUCENE-8396.patch (versions: 2), polyWHole.png, tessellatedPoly.png

asfimport commented 6 years ago

Nick Knize (@nknize) (migrated from JIRA)

I've attached a WIP patch for initial feedback.

Patch summary:

Javadocs are included. There are many todos, such as generalizing the bounding box query into a generic shape query that will query indexed shapes with provided query shapes. I think this can be done rather easily by adding a relateTriangle utility method to Polygon2D for computing relations of indexed triangles with the target shape. That, however, will be left as a future feature.

asfimport commented 6 years ago

Lucene/Solr QA (migrated from JIRA)

-1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
master Compile Tests
+1 compile 0m 57s master passed
Patch Compile Tests
-1 compile 0m 8s sandbox in the patch failed.
-1 javac 0m 7s sandbox in the patch failed.
+1 Release audit (RAT) 0m 32s Release audit (RAT) rat-sources passed
-1 Release audit (RAT) 0m 4s sandbox in the patch failed.
+1 Release audit (RAT) 0m 7s Release audit (RAT) rat-sources passed
-1 Check forbidden APIs 0m 32s Check forbidden APIs check-forbidden-apis failed
-1 Validate source patterns 0m 32s Check forbidden APIs check-forbidden-apis failed
Other Tests
+1 unit 21m 12s core in the patch passed.
-1 unit 0m 7s sandbox in the patch failed.
+1 unit 3m 26s test-framework in the patch passed.
27m 51s
Subsystem Report/Notes
JIRA Issue LUCENE-8396
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12931367/LUCENE-8396.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / 8997d41
ant version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014
Default Java 1.8.0_172
compile https://builds.apache.org/job/PreCommit-LUCENE-Build/46/artifact/out/patch-compile-lucene_sandbox.txt
javac https://builds.apache.org/job/PreCommit-LUCENE-Build/46/artifact/out/patch-compile-lucene_sandbox.txt
Release audit (RAT) https://builds.apache.org/job/PreCommit-LUCENE-Build/46/artifact/out/patch-compile-lucene_sandbox.txt
Check forbidden APIs https://builds.apache.org/job/PreCommit-LUCENE-Build/46/artifact/out/patch-check-forbidden-apis-lucene.txt
Validate source patterns https://builds.apache.org/job/PreCommit-LUCENE-Build/46/artifact/out/patch-check-forbidden-apis-lucene.txt
unit https://builds.apache.org/job/PreCommit-LUCENE-Build/46/artifact/out/patch-unit-lucene_sandbox.txt
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/46/testReport/
modules C: lucene/core lucene/sandbox lucene/test-framework U: lucene
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/46/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This looks awesome. Indexing two additional dimensions sounds worth it to me if it helps index way fewer fields. It's also exciting we're exercising high numbers of dimensions with the points API. Since this is targeting sandbox and the patch is a bit large, I think it's easier to get it in soon and iterate from there? It looks pretty clean to me already. Let's maybe just make windingOrder final on the Polygon class first.

Some other comments I had while skimming through the patch:

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

do quantization first [...] and then make tessellation work directly in the quantized space

Now that I think about it, it's probably easier to do it the other way around, as the polygon could easily become invalid through quantization? Having triangles hold doubles rather than quantized ints might also help if we want to further split these triangles in the future, eg. to prevent flat triangles from increasing the area of the intersection of MBRs of sibling nodes in the tree.

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This looks really cool!  It is great to see many dimensions being used with points.  +1 to push to sandbox and iterate there.

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

+1 super cool Nick!  I like how the use of a tessellation technique can represent polygons with a number of triangles on the order of the number of edge vertices, and you wind up with perfect accuracy & scalability.  There are also some off the shelf computational geometry libraries that can simplify polygons (JTS can), and some users may want to do that when indexing shapes.

In your example you show a quad grid index of today decomposing a shape into a million terms.  Why would someone do this?  If someone wants to index shapes today with spatial-extras, they ought to use SerializedDVStrategy for accuracy combined with RecursivePrefixTreeStrategy for a grid index (perhaps 20% distErrPct) and CompositeSpatialStrategy wrapping both.  The number of terms for any shape is effectively capped and controlled indirectly via distErrPct – perhaps 100 terms for distErrPct=0.2?  (not sure without trying).

asfimport commented 6 years ago

Nick Knize (@nknize) (migrated from JIRA)

Thanks @jpountz, @mikemccand, and @dsmiley some great feedback here:

Great suggestions @jpountz. I'll make these changes, clean up the code, and push to sandbox for iterating.

@dsmiley

There are also some off the shelf computational geometry libraries that can simplify polygons (JTS can), and some users may want to do that when indexing shapes.

I took a look at several of the OTS simplification a while ago. I'm sure there are many more out there than what I canvassed so this isn't intended to be a catch all comment. And we could certainly look at other tessellation libraries in the future for iterative performance improvements. But I did look at the JTS simplification tools. The problem with those varied: from tolerance value errors creating invalid polygons to general performance overhead. Each were great as a stand alone utility, but none were reliable nor performant enough for the scale desired.

they ought to use SerializedDVStrategy for accuracy combined with RecursivePrefixTreeStrategy for a grid index

Indeed this is an optimization allowing one to index "larger" quad cells (to reduce the number of terms in the inverted index) and subsequently use the WKB BinaryDocValues for accurate relations. But not only is the use of BinaryDocValues with WKB limiting (with respect to shapes with large number of vertices and the marshalling/unmarshalling overhead) but all of the calls to .relate (with either JTS, S2, or other third party implementations) incur their own performance penalties.. not just for relating each quad cell to the query shape but also with having to unmarshall "leaf shapes" and compute the DE9IM to relate those to the query shape for accuracy.

combined with...and...wrapping both

This was the other motivator... Like LatLonPoint its nice to have a simple API (hopefully intended for core) that can solve the general Shape indexing and search use case without having to decide which PrefixTree, Grid, and Shape Relation libraries to use. Then for the more expert use cases users can always move over to spatial-extras for all of the additional choices.

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I definitely understand that this new shape indexing scheme is better than serialized geometry + a grid, and why it is – no need to sell me :).  The benchmarks will show an amazing improvement, I'm sure – especially for shapes with a high number of vertexes.

I want to list some use-cases in which spatial-extras is (still) useful.  This list might go into documentation for spatial-extras.  A new LatLonShape here reduced the list somewhat – certainly for a common case (albeit indexing anything other than points simply isn't common among search/Lucene users).  Perhaps in time LatLonShape will mature in capability enough for us to remove spatial-extras if there's not much worth saving for the complexity.  It'd be helpful if you could validate your understanding against mine below.

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I took a peek at the patch.  One bit surprised me.  There is an optimization with a comment as follows:

// If all docs have exactly one value and the cost is greater
// than half the leaf size then maybe we can make things faster
// by computing the set of documents that do NOT match the query

But isn't that impossible to detect when the indexed data is comprised of multiple triangles per document? It uses PointValues.size() to detect this but that value isn't useful here, right? I'm guessing you copy-pasted this logic for LatLonPoint code where it does apply.

 

 

asfimport commented 6 years ago

Nick Knize (@nknize) (migrated from JIRA)

Updated patch incorporating feedback from first round review.

asfimport commented 6 years ago

Nick Knize (@nknize) (migrated from JIRA)

But isn't that impossible to detect

Extremely unlikely to next to impossible? Yes, I think so. So I'm pretty confident this logic can be removed.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit b5ef13330f179ac6a2d98b9dfa4803811190e773 in lucene-solr's branch refs/heads/master from @nknize https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b5ef133

LUCENE-8396: Add Points Based Shape Indexing and Search that decomposes shapes into a triangular mesh and indexes individual triangles as a 6 dimension point

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 3b1714e737049ec562c8dce66fc5fe8cdad74097 in lucene-solr's branch refs/heads/master from @nknize https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b1714e

LUCENE-8396: silence random large poly test - for now

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit bc29580378d78f707b5cf26da8321980e907573c in lucene-solr's branch refs/heads/branch_7x from @nknize https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc29580

LUCENE-8396: Add Points Based Shape Indexing and Search that decomposes shapes into a triangular mesh and indexes individual triangles as a 6 dimension point

asfimport commented 6 years ago

Lucene/Solr QA (migrated from JIRA)

-1 overall
Vote Subsystem Runtime Comment
-1 patch 0m 5s LUCENE-8396 does not apply to master. Rebase required? Wrong Branch? See https://wiki.apache.org/lucene-java/HowToContribute#Contributing_your_work for help.
Subsystem Report/Notes
JIRA Issue LUCENE-8396
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12931617/LUCENE-8396.patch
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/47/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 6 years ago

Nick Knize (@nknize) (migrated from JIRA)

Test Failure....

 [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestLatLonShapeQueries -Dtests.method=testRandomMedium -Dtests.seed=70F032E1B960036E -Dtests.multiplier=3 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th -Dtests.timezone=US/Eastern -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 24.3s J1 | TestLatLonShapeQueries.testRandomMedium <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: wrong hit (first of possibly more):
   [junit4]    > FAIL: id=7542 should match but did not
   [junit4]    >   query=LatLonShapeBoundingBoxQuery: field=shape:Rectangle(lat=0.0 TO 0.0 lon=8.381903171539307E-8 TO 90.07557132281363) docID=9514
   [junit4]    >   polygon=[-4.190951585769653E-8, 0.0] [89.99999995809048, 0.0] [89.99999995809048, 179.99999991618097] [-4.190951585769653E-8, 0.0] 
   [junit4]    >   deleted?=false  rect=Rectangle(0.0 TO 0.0 lon=8.381903171539307E-8 TO 90.07557132281363)
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([70F032E1B960036E:CD2E0549F8056008]:0)
   [junit4]    >        at org.apache.lucene.document.TestLatLonShapeQueries.verifyRandomBBoxes(TestLatLonShapeQueries.java:263)
   [junit4]    >        at org.apache.lucene.document.TestLatLonShapeQueries.verify(TestLatLonShapeQueries.java:134)
   [junit4]    >        at org.apache.lucene.document.TestLatLonShapeQueries.doTestRandom(TestLatLonShapeQueries.java:130)
   [junit4]    >        at org.apache.lucene.document.TestLatLonShapeQueries.testRandomMedium(TestLatLonShapeQueries.java:102)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> NOTE: leaving temporary files on disk at: /home/jenkins/workspace/Lucene-Solr-BadApples-master-Linux/lucene/build/sandbox/
asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Test failure, from https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Windows/726/:

Checking out Revision a9f129190f9065c8775a628df181fb53248db488 (refs/remotes/origin/branch_7x)
[...]
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestLatLonShapeQueries -Dtests.method=testRandomTiny -Dtests.seed=1BAECFBE374F07D6 -Dtests.slow=true -Dtests.locale=en-PN -Dtests.timezone=America/Edmonton -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.03s J1 | TestLatLonShapeQueries.testRandomTiny <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: wrong hit (first of possibly more):
   [junit4]    > FAIL: id=11 should not match but did
   [junit4]    >   query=LatLonShapeBoundingBoxQuery: field=shape:Rectangle(lat=-52.35433959402144 TO 58.83083060849458 lon=-79.63909948244691 TO 179.99999991618097) docID=11
   [junit4]    >   polygon=[-4.190951585769653E-8, -152.01172342523932] [89.99999995809048, -152.01172342523932] [89.99999995809048, -49.80329998768866] [-4.190951585769653E-8, -152.01172342523932] 
   [junit4]    >   deleted?=false  rect=Rectangle(-52.35433959402144 TO 58.83083060849458 lon=-79.63909948244691 TO 179.99999991618097)
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([1BAECFBE374F07D6:52E911F8696E3F7A]:0)
   [junit4]    >    at org.apache.lucene.document.TestLatLonShapeQueries.verifyRandomBBoxes(TestLatLonShapeQueries.java:262)
   [junit4]    >    at org.apache.lucene.document.TestLatLonShapeQueries.verify(TestLatLonShapeQueries.java:133)
   [junit4]    >    at org.apache.lucene.document.TestLatLonShapeQueries.doTestRandom(TestLatLonShapeQueries.java:129)
   [junit4]    >    at org.apache.lucene.document.TestLatLonShapeQueries.testRandomTiny(TestLatLonShapeQueries.java:97)
   [junit4]    >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
   [junit4]    >    at java.base/java.lang.Thread.run(Thread.java:844)
[...]
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70), sim=RandomSimilarity(queryNorm=true): {}, locale=en-PN, timezone=America/Edmonton