elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.09k stars 24.84k forks source link

Cannot index some geo_shape geometries (before and after Geo Refactoring changes) #3909

Closed tommcintyre closed 9 years ago

tommcintyre commented 11 years ago

Probably FAO @chilling .

I have a mapping that includes some geo_shape fields. My test data contains GeoJSON fields that specify points, but do so in the form of polygons with the same lon/lat repeated 4 (or sometimes 3) times. This causes a validation exception to be thrown up from within Spatial4J when it generates the polygon.

I am not sure if this is technically invalid GeoJSON or not - however, this is a form that the Twitter API generates frequently (around 1 in 1500 tweets in my dataset), and other libraries I have used can parse it OK. It would be good if ES can attempt to do what it can with data like this, rather than failing (i.e. treat it as another appropriate type like a point, or relax the verification).

This is tested against the branch that includes the Geo-Refactoring improvements at chilling/elasticsearch@0369983afdcc7c6e8db7703b20889f69ac14221a (it does also happen on master, if you can get that far!).

The stack trace:

org.elasticsearch.index.mapper.MapperParsingException: failed to parse [place]
    at org.elasticsearch.index.mapper.geo.GeoShapeFieldMapper.parse(GeoShapeFieldMapper.java:232)
    at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:515)
    at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:457)
    at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:515)
    at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:457)
    at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:515)
    at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:457)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:508)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:452)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:341)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:533)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:418)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    ... 1 more
Caused by: com.spatial4j.core.exception.InvalidShapeException: Too few distinct points in geometry component at or near point (-81.872495, 36.163117, NaN)
    at com.spatial4j.core.shape.jts.JtsGeometry.(JtsGeometry.java:90)
    at org.elasticsearch.common.geo.builders.BasePolygonBuilder.build(BasePolygonBuilder.java:153)
    at org.elasticsearch.index.mapper.geo.GeoShapeFieldMapper.parse(GeoShapeFieldMapper.java:219)
    ... 15 more

To reproduce:

curl -XPUT http://server:9200/dummyindex -d '
{
    "mappings": {
        "dummytype": {
            "properties": {
                "place": {
                    "type": "geo_shape",
                    "tree": "quadtree",
                    "precision": "10m"
                }
            }
        }
    }
}'

curl -XPOST http://server:9200/dummyindex/dummytype/1 -d '
{
    "place": {
        "coordinates": [[[-81.872495, 36.163117], [-81.872495, 36.163117], [-81.872495, 36.163117], [-81.872495, 36.163117]]],
        "type": "Polygon"
    }
}'

I wrote a nasty, hacky workaround to fix this particular case - see below. This is obviously not an acceptable general purpose solution, as it doesn't address the issue for any other shapes, or other cases like only having 2 or 3 distinct points. I am not really familiar enough with the libraries, but I guess a real fix might involve simplifying or normalizing each geometry somehow before it gets passed to Spatial4J?

Alternatively it could be deemed that this is not ES's job, and the GeoJSON needs to be more well-formed - but I think this is likely to be a common problem due to the source of this data, and because of the different expectations of different GeoJSON libraries.

ShapeBuilder.java:

-        protected static PolygonBuilder parsePolygon(CoordinateNode coordinates) {
+        protected static ShapeBuilder parsePolygon(CoordinateNode coordinates) {
             LineStringBuilder shell = parseLineString(coordinates.children.get(0));
+            
+            if (new HashSet<Coordinate>(shell.points).size() == 1)
+                return newPoint(shell.points.get(0));
+            
             PolygonBuilder polygon = new PolygonBuilder(shell.points);
             for (int i = 1; i < coordinates.children.size(); i++) {
                 polygon.hole(parseLineString(coordinates.children.get(i)));

...

         protected static MultiPolygonBuilder parseMultiPolygon(CoordinateNode coordinates) {
             MultiPolygonBuilder polygons = newMultiPolygon();
             for (CoordinateNode node : coordinates.children) {
-                polygons.polygon(parsePolygon(node));
+              polygons.polygon((PolygonBuilder)parsePolygon(node));
             }
             return polygons;
         }
tommcintyre commented 11 years ago

Some chat from IRC with opinions about whether this belongs in ES or not... :)

<Tim> GeoJSON spec is a little vague. A polygon must have a LinearRing, defined as:
<Tim> A LinearRing is closed LineString with 4 or more positions. The first and last positions are equivalent (they represent equivalent points). Though a LinearRing is not explicitly represented as a GeoJSON geometry type, it is referred to in the Polygon geometry type definition.
<Tim> not really clear that they must be distinct or non-self-intersecting
<Tom> yeah - i think it's just not well defined
<Tom> the new geo-refactored ES code does seem to attempt to make the best possible use of "borderline" valid geometries
<Tom> for example, twitter emits open linestrings quite often, as well these polygons with repeating points
<Tom> and the new geo-refactored ES code does now accept open linestrings
<Tom> (the old code didn't)
chilling commented 11 years ago

Hi @mvjars,

I think the GeoJson is technically valid. The problem is forced by the underlying libraries that we use to handle geo data inside ES. These libraries make some assumptions to the data which are checked before the data gets indexed. In my opinion this assumptions must not necessarily be checked if the GeoJson is indexed. But currently we use this libraries to index geo data also. I'm working on separating the GeoJson format specification from the logic layer and move this kind of exceptions to the logic layer.

lababidi commented 9 years ago

Is there any update on this issue?

clintongormley commented 9 years ago

@nknize any ideas here?

nknize commented 9 years ago

@lababidi Out of curiosity why make this a polygon type? Why not change it to a multipoint?

lababidi commented 9 years ago

@nknize Thanks for checking in. I'm not sure I understand what your question is. Let me see if I can reframe the issue, succinctly.

Twitter provides the following object. It is rejected by elasticsearch for only having 4 elements and not 5. This strict requirement is not consistent with GeoJSON. Please allow elasticsearch to accept this Object, out of the box, because technically it is legal GeoJSON. Currently I must write a middle-man to add a 5th-element (cue Bruce Willis jokes here) to this array of coordinates. This is a bit kludgy in terms of workflow:

"bounding_box": {
        "type": "Polygon",
        "coordinates": [
        [
        [
        -9.0915413,
        38.6713816
        ],
        [
        -9.0915413,
        38.9313732
        ],
        [
        -8.8102479,
        38.9313732
        ],
        [
        -8.8102479,
        38.6713816
        ]
        ]
        ]
        },
nknize commented 9 years ago

@lababidi Thanks for the update. I mistook the problem you're describing for the original issue (repeated points in a polygon). I agree we should accept Polygons like this and close them for you.

lababidi commented 9 years ago

This is excellent news. I'd be happy to contribute any code if your plate is full. Just point me to where the preferred location that this logic belongs in and I will pull request.

lababidi commented 9 years ago

@nknize If you prefer to close this issue, you may. I realized my statement was a separate issue from this issue and create it at #11131

nknize commented 9 years ago

@lababidi Excellent! If you'd like to get involved we'd love to have you contribute. You'll need to sign the CLA if you haven't already. I went ahead and opened a PR to address #11131, have a look and try it out.