apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Indexing non-point shapes close to the poles doesn't scale [LUCENE-5056] #6120

Open asfimport opened 11 years ago

asfimport commented 11 years ago

From: Hal Deadman We are seeing an issue where certain shapes are causing Solr to use up all available heap space when a record with one of those shapes is indexed. We were indexing polygons where we had the points going clockwise instead of counter-clockwise and the shape would be so large that we would run out of memory. We fixed those shapes but we are seeing this circle eat up about 700MB of memory before we get an OutOfMemory error (heap space) with a 1GB JVM heap. Circle(3.0 90 d=0.0499542757922153) Google Earth can't plot that circle either, maybe it is invalid or too close to the north pole due to the latitude of 90, but it would be nice if there was a way for shapes to be validated before they cause an OOM error. The objects (4.5 million) are all GeohashPrefixTree$GhCell objects in an ArrayList owned by PrefixTreeStrategy$CellTokenStream. Is there anyway to have a max number of cells in a shape before it is considered too large and is not indexed? Is there a geo library that could validate the shape as being reasonably sized and bounded before it is processed? We are currently using Solr 4.1. <fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory" geo="true" distErrPct="0.025" maxDistErr="0.000009" units="degrees" />

indexed circle close to the pole.png


Migrated from LUCENE-5056 by Hal Deadman, updated May 09 2016 Attachments: indexed circle close to the pole.png Linked issues:

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

The attached pic shows the circle of the same size (some \~5km radius) at 88 degrees latitude. It generated a whopping 7888 cells; and it gets worse closer to the pole. Technically this approach works but it clearly doesn't scale at the poles.

I'm gonna have to think about this one for a bit. I think a fix requires a new SpatialPrefixTree encoding that divides the world differently at the poles. Solving this is arguably a new requirement for #5987.

asfimport commented 11 years ago

Hal Deadman (migrated from JIRA)

This might not be the same issue, but we have a small rectangle that uses a really large amount of memory:

POLYGON((1.025 1.025, 1.025 1.101, 1.101 1.101, 1.101 1.025, 1.025 1.025))

If we change it just a little we don't get out of memory errors:

POLYGON((1.025001 1.025, 1.025 1.101, 1.101 1.101, 1.101 1.025, 1.025001 1.025))

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

The WKT spec says counter-clockwise order for the outer shell, and Spatial4j demands that for rectangles expressed as polygons. A lot of software (OpenLayers, JTS, PostGIS) doesn't care and lets you do it however you want, even though technically the shape is ambiguous (which part of the ring is the inside vs the outside?). This is in the FAQ on Solr's wiki. In the next version of Spatial4j I'll make it support both.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

FWIW I'm going to try hard to get a fix in by Lucene 4.7: https://github.com/spatial4j/spatial4j/issues/52

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I was just thinking about this bug/shortcoming or whatever one might call it. There's an easy solution to this – modify the algorithm that determines how many "levels" to go down so it's based on a Euclidean computation, not geodesic. It means that the shape is going to be a lot more "blocky" (approximated) than the same same on the equator. But I feel that doesn't matter, or at least it won't matter soon once RecursivePrefixTree & SerializedDVStrategy get linked up such that indexed non-point shapes will be validated for precision against the actual vector geometry. Once that happens, it will matter very little how few grid cells represent a shape since you'll always have absolute precision as far as the shape geometry can calculate it.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.