apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

TestGeoUtils.testGeoRelations is buggy with irregular rectangles [LUCENE-6908] #7966

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

The .testGeoRelations method doesn't exactly test the behavior of GeoPoint*Query as its using the BKD split technique (instead of quad cell division) to divide the space on each pass. For "large" distance queries this can create a lot of irregular rectangles producing large radial distortion error when using the cartesian approximation methods provided by GeoUtils. This issue improves the accuracy of GeoUtils cartesian approximation methods on irregular rectangles without having to cut over to an expensive oblate geometry approach.


Migrated from LUCENE-6908 by Nick Knize (@nknize), resolved Jan 18 2016 Attachments: LUCENE-6908.patch (versions: 5)

asfimport commented 8 years ago

Nick Knize (@nknize) (migrated from JIRA)

Patch includes:

asfimport commented 8 years ago

Nick Knize (@nknize) (migrated from JIRA)

Updated patch includes the following:

Next step is to test w/ BKD integration.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hi @nknize, I think you forgot to add GeoRelationUtils.java before making the patch? And also, can you make the patch from higher up in the source tree? It seems to be missing e.g. changes to SloppyMath.java.

asfimport commented 8 years ago

Nick Knize (@nknize) (migrated from JIRA)

Thanks @mikemccand! Fixed the patch to include GeoRelationUtils and changes to SloppyMath.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @nknize! Lots of geo math that I don't understand :)

It hit a failure after some beasting:

   [junit4] Started J0 PID(28075@localhost).
   [junit4] Suite: org.apache.lucene.util.TestGeoUtils
   [junit4]   1> doc=2304 matched but should not
   [junit4]   1>   lon=-52.68950667232275 lat=-89.24937685020268 distanceMeters=267809.96722662856 vs radiusMeters=263432.23741533695
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestGeoUtils -Dtests.method=testGeoRelations -Dtests.seed=3B46325723FB86A9 -Dtests.multiplier=5 -Dtests.slow=true -Dtests.linedocsfile=/lucenedata/hudson.enwiki.random.lines.txt.fixed -Dtests.locale=vi_VN -Dtests.timezone=America/Halifax -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 2.61s | TestGeoUtils.testGeoRelations <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: 1 incorrect hits (see above)
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([3B46325723FB86A9:F96526E2570CF017]:0)
   [junit4]    >    at org.apache.lucene.util.TestGeoUtils.testGeoRelations(TestGeoUtils.java:533)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=HighCompressionCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=HIGH_COMPRESSION, chunkSize=9, maxDocsPerChunk=323, blockSize=5), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=HIGH_COMPRESSION, chunkSize=9, blockSize=5)), sim=RandomSimilarityProvider(queryNorm=true,coord=yes): {}, locale=vi_VN, timezone=America/Halifax
   [junit4]   2> NOTE: Linux 3.13.0-61-generic amd64/Oracle Corporation 1.8.0_60 (64-bit)/cpus=8,threads=1,free=396588944,total=504889344
   [junit4]   2> NOTE: All tests run in this JVM: [TestGeoUtils]
   [junit4] Completed [1/1] in 2.77s, 1 test, 1 failure <<< FAILURES!

Also, this comment seems stale now? (We seem to be using this method?):

// NOTE: not used for 2d at the moment. used for 3d w/ altitude (we can keep or add back)
asfimport commented 8 years ago

Nick Knize (@nknize) (migrated from JIRA)

Thanks @mikemccand. In isClosestPointOnRectWithinRange it was still using SloppyMath.haversin where it should have been using the Sinnott haversine. Looks like it was caused by a premature diff from my benchmark workspace.

A quick interesting benchmark re: distance methods. Using 2M iterations I benchmarked average computation time, average spatial error, and maximum spatial error between 5 different geospatial distance methods:

Distance Method Avg Computation (ns) Avg Error (%) Max Error (%)
Vincenty 1286.5529689998828 8.4561794778802E-11 7.2565793775802E-10
Karney 31770.479206999946 8.4561794778802E-11 7.2565793775802E-10
HaverECF 717.1139850000152 0.18409042301798453 0.6681179192384695
SloppyMath.haverin 159.3095249999995 0.22594450803222424 0.6539314586426048
Sinnott Haversine 146.81236699999738 0.18158435835918563 0.4931242857748373

I need to run some better descriptive statistics before drawing any conclusions. At the moment, the discrepancy between Sloppymath.haversin and Sinnott's implementation is consistent with the error described in the original publication (which is where the tolerance of 0.5% originated).

More to come....

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Those benchmark results are nice @nknize! Is the source for this bench checked in somewhere? From this, it seems like we should switch to Sinnot Haversine? It's fastest and lowest error?

I beasted for a while and no failures!

+1 to commit! Hopefully this means we can add DimensionalDistanceQuery and it just works!

asfimport commented 8 years ago

Nick Knize (@nknize) (migrated from JIRA)

Thanks @mikemccand! I plan to add the benchmark to luceneutil unless you think there's a better place for it? Going to commit shortly.

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1718481 from @nknize in branch 'dev/trunk' https://svn.apache.org/r1718481

LUCENE-6908: Fix TestGeoUtils.testGeoRelations to handle irregular rectangles. Refactors relational methods to new GeoRelationUtils class.

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1718486 from @nknize in branch 'dev/branches/branch_5x' https://svn.apache.org/r1718486

LUCENE-6908: Fix TestGeoUtils.testGeoRelations to handle irregular rectangles. Refactors relational methods to new GeoRelationUtils class.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

My Jenkins found a reproducible failure in TestGeoUtils.testGeoRelations:

   [junit4] Suite: org.apache.lucene.util.TestGeoUtils
   [junit4]   1> doc=1921 matched but should not on iteration 229
   [junit4]   1>   lon=87.14082470163703 lat=-89.39206877723336 distanceMeters=205662.45440744862 vs radiusMeters=203580.37384777897
   [junit4]   1> doc=2077 matched but should not on iteration 229
   [junit4]   1>   lon=63.26208980754018 lat=-89.36728684231639 distanceMeters=204170.67218267516 vs radiusMeters=203580.37384777897
   [junit4]   2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory.
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestGeoUtils -Dtests.method=testGeoRelations -Dtests.seed=4513B1942DE0E2D3 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=lv -Dtests.timezone=America/St_Vincent -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 2.24s | TestGeoUtils.testGeoRelations <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: 2 incorrect hits (see above)
asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

See also Policeman Jenkins failures at http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/14886/ and http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/15198/

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Another reproducible failure from my Jenkins (on branch_5x):

   [junit4] Suite: org.apache.lucene.util.TestGeoUtils
   [junit4]   1> doc=1357 matched but should not on iteration 2
   [junit4]   1>   lon=-24.478504993021488 lat=88.08790524490178 distanceMeters=303542.5030168316 vs radiusMeters=295327.75445295114
   [junit4]   2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory.
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestGeoUtils -Dtests.method=testGeoRelations -Dtests.seed=BAD657FC92E56596 -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=lv -Dtests.timezone=America/Knox_IN -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 0.33s J0 | TestGeoUtils.testGeoRelations <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: 1 incorrect hits (see above)
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([BAD657FC92E56596:78F54349E6121328]:0)
   [junit4]    >    at org.apache.lucene.util.TestGeoUtils.testGeoRelations(TestGeoUtils.java:533)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=Lucene54, sim=RandomSimilarityProvider(queryNorm=false,coord=crazy): {}, locale=lv, timezone=America/Knox_IN
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=16,threads=1,free=434060912,total=514850816
   [junit4]   2> NOTE: All tests run in this JVM: [TestGeoUtils]
   [junit4] Completed [1/20 (1!)] on J0 in 1.14s, 8 tests, 1 failure <<< FAILURES!
asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

One more reproducible failure from my Jenkins on branch_5x (I'll stop adding more after this one):

   [junit4] Suite: org.apache.lucene.util.TestGeoUtils
   [junit4]   1> doc=1431 matched but should not on iteration 50
   [junit4]   1>   lon=14.089814610779285 lat=88.21761829778552 distanceMeters=310627.1321615869 vs radiusMeters=308762.06620344025
   [junit4]   2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory.
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestGeoUtils -Dtests.method=testGeoRelations -Dtests.seed=70AF7A8C5D104698 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=es_CR -Dtests.timezone=Africa/Maputo -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 1.28s J3 | TestGeoUtils.testGeoRelations <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: 1 incorrect hits (see above)
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([70AF7A8C5D104698:B28C6E3929E73026]:0)
   [junit4]    >    at org.apache.lucene.util.TestGeoUtils.testGeoRelations(TestGeoUtils.java:533)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene54): {}, docValues:{}, sim=DefaultSimilarity, locale=es_CR, timezone=Africa/Maputo
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=16,threads=1,free=426047992,total=514850816
   [junit4]   2> NOTE: All tests run in this JVM: [TestGeoUtils]
   [junit4] Completed [3/20 (1!)] on J3 in 1.74s, 8 tests, 1 failure <<< FAILURES!
asfimport commented 8 years ago

Nick Knize (@nknize) (migrated from JIRA)

++ Thanks @sarowe There's a geometric approximation in the within that's a bit too lenient for BKD. These seeds have been super helpful in regression testing a fix. Should have a patch shortly. Sorry for the noise!

asfimport commented 8 years ago

Nick Knize (@nknize) (migrated from JIRA)

Updated patch includes:

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1722448 from @nknize in branch 'dev/trunk' https://svn.apache.org/r1722448

LUCENE-6908: Add space segmentation for handling irregular rectangle accuracy at the poles.

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1722449 from @nknize in branch 'dev/branches/branch_5x' https://svn.apache.org/r1722449

LUCENE-6908: Add space segmentation for handling irregular rectangle accuracy at the poles.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

@nknize can this be resolved?