apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.04k forks source link

Does the method #cureLocalIntersections in the Tessellator make any sense? #11767

Open iverase opened 2 years ago

iverase commented 2 years ago

Description

I have always been confused by this method which claims to fix self-intersections of the polygon but we require in the Tessellator java docs that polygons should not have self-intersections. More over, we have no test that needs this method to pass so it is currently untested / unnecessary.

I am bringing this up because I have an example of a polygon that it seems to spend half of the time on that method. It is a huge polygon that with this method takes around 560 seconds to tessellate but 260 seconds if we don't go through this method.

My proposal is to remove the method completely or at least not call this method if the Tessellator has been called with the flag checkSelfIntersections set to true.

@nknize introduced this method on the first version of the Tessellator, he might have more background about the need of this method. what do you think?

llermaly commented 2 years ago

Hi @iverase would be nice if you could go to https://github.com/apache/lucene/issues/11777 and test with those polygons as well. We are having Elastic Cloud timeouts because of the time it takes to triangulate.

nknize commented 2 years ago

My proposal is to remove the method completely or at least not call this method if the Tessellator has been called with the flag checkSelfIntersections set to true.

@nknize introduced this method on the first version of the Tessellator, he might have more background about the need of this method. what do you think?

I was actually looking into this before turning my attention to the shape doc values and I was just getting ready to come back to it. The method was originally introduced to postpone self intersection removal unless absolutely necessary (e.g., tessellation failed). Essentially it was a lazy cleaning approach. I believe, though this needs thorough evaluation, some of the improvements made to filter points rendered much of this logic obsolete. My last test (random and explicit) never actually exercised this logic, even with self intersections. Furthermore, the follow up SPLIT logic was not exercised either so I was exploring the possibility of removing both of these phases.

if you could go to https://github.com/apache/lucene/issues/11777 and test with those polygons as well. We are having Elastic Cloud timeouts because of the time it takes to triangulate.

These adversarial cases are important to capture in our tests. Our randomized polygon generator doesn't inject any self intersections so we really have a gap in testing the logic coverage.

llermaly commented 2 years ago

Here I have some valid polygons being rejected for self intersecting, in case are useful for you to test:

https://github.com/apache/lucene/issues/11776

iverase commented 2 years ago

The method was originally introduced to postpone self intersection removal

I don't understand this. We re claiming in the java docs that polygons should not be self-intersecting and we do not introduce self-intersections in our code, why we want to remove them?

 * <ul>
 *   <li>Requires valid polygons:
 *       <ul>
 *         <li>No self intersections
 *         <li>Holes may only touch at one vertex
 *         <li>Polygon must have an area (e.g., no "line" boxes)
 *         <li>sensitive to overflow (e.g, subatomic values such as E-200 can cause unexpected
 *             behavior)
 *       </ul>
 * </ul>

Looking at the original code which the tessellator is inspired on, the method was introduced to handle some OSM polygons that contain self-intersections, hence not valid: https://github.com/mapbox/earcut/issues/8 As we claim we only support valid polygons, I think it is safe to remove the method entirely.

@llermaly I found this issue by looking into one of your polygons so we should expect nice performance improvements.

nknize commented 2 years ago

I don't understand this. We re claiming in the java docs that polygons should not be self-intersecting and we do not introduce self-intersections in our code, why we want to remove them?

Because real life Geo data doesn't care what our javadocs say. Small self intersections are a reality that rears its head every now and then in real data and the performance hit to "best effort" detect and clean in the tessellator's cure phase at index time was worth more than directing user's to a third party cleaning utility before indexing.

Our blind polygon class does nothing to enforce those javadocs so maybe before completely removing we might consider flagging that phase as an optional validation step disabled by default? (Think ignore_malformed)

iverase commented 2 years ago

Then let's just skip that step if the tessellator is called with the flag checkSelfIntersections set to true. In that case it will fail before getting to this place.