apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.95k stars 692 forks source link

Intersected Geometry causes Exception on #1612

Open sehHeiden opened 1 week ago

sehHeiden commented 1 week ago

Expected behaviour

After the intersection. I expect that in sql ST_ISValid(geom) = false will list all problematic geometry, or that ST_MakeValid would fix them.

Actual behaviour

Instead, the filtering/fixing an exception is caught.

Steps to reproduce the problem

You can also see the error and the attempts for fixing at stackoverflow

Data Source: https://data.geobasis-bb.de/geobasis/daten/alkis/Vektordaten/shape/

I download data for all counties, unpacked them, loaded them. Did the intersection.

I could neither, count the data, nor make them ST_MakeValid, filter them with ST_IsValid and show.

from sedona.spark import *

config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.6.1,'
           'org.datasyslab:geotools-wrapper:1.6.1-28.2'). \
    getOrCreate()

sedona = SedonaContext.create(config)

buildings_gdf = sedona.read.format("shapefile").option("recursiveFileLookup", "true").load("../ALKIS/*/GebauedeBauwerk.shp").dropDuplicates(["oid"])
buildings_gdf.createOrReplaceTempView("buildings")
nutzung_gdf = sedona.read.format("shapefile").option("recursiveFileLookup", "true").load("../ALKIS/*/NutzungFlurstueck.shp").dropDuplicates(["oid"])
nutzung_gdf.createOrReplaceTempView("usage")

result_gdf = sedona.sql("""SELECT ST_Intersection(b.geometry, u.geometry) AS geometry, b.oid AS building_oid, b.gebnutzbez, b.gfkzshh, b.name, b.anzahlgs, b.gmdschl, b.lagebeztxt, b.funktion, u.oid AS flur_oid, u.nutzart, u.bez, u.flstkennz
                       FROM buildings b, usage u 
                       WHERE ST_Intersects(b.geometry, u.geometry)"""
                   )
result_gdf.createOrReplaceTempView("result")

result2_gdf = sedona.sql("""
                        SELECT building_oid
                        FROM result
                        WHERE ST_IsValid(geometry, 1) = false;
                         """)
result2_gdf.show()

The show causes the error:

org.apache.spark.sql.sedona_sql.expressions.InferredExpressionException: Exception occurred while evaluating expression ST_Intersection - inputs: [POLYGON ((354735.34 5758459.87, 354741.52 5758459.29, 354741.47 5758458.76, 354742.003 5758458.661, 354742.51 5758458.467, 354742.974 5758458.186, 354743.379 5758457.825, 354743.713 5758457.397, 354743.964 5758456.917, 354744.125 5758456.399, 354744.19 5758455.86, 354744.839 5758455.842, 354744.839 5758454.833, 354744.11 5758454.92, 354743.905 5758454.454, 354743.625 5758454.029, 354743.276 5758453.658, 354742.87 5758453.351, 354742.418 5758453.117, 354741.933 5758452.963, 354741.429 5758452.894, 354740.92 5758452.91, 354740.87 5758452.33, 354734.69 5758452.91, 354734.47 5758450.53, 354719.262 5758452.06, 354719.872 5758458.208, 354718.444 5758458.366, 354719.11 5758464.36, 354735.59 5758462.51, 354735.34 5758459.87)), MULTIPOLYGON (((354742.48361941514 5758458.483279419, 354742.5175583665 5758458.466906712, 354742.5512882294 5758458.450107478, 354742.5848036437 5758458.432884384, 354742.5599997879 5758458.445518499, 354742.53506431903 5758458.457890828, 354742.51 5758458.47, 354742.50122503354 5758458.474463283, 354742.49243145384 5758458.478889782, 354742.48361941514 5758458.483279419)), ((354744.19 5758455.86, 354744.18363257905 5758455.978846373, 354744.173464127 5758456.097428019, 354744.15950506565 5758456.215623404, 354744.1743896288 5758456.097518557, 354744.18455998343 5758455.978914724, 354744.19 5758455.86)), ((354742.76916934864 5758458.3281861525, 354742.856506923 5758458.272293335, 354742.9418533804 5758458.213404956, 354743.02510602196 5758458.151591878, 354743.006871307 5758458.164586073, 354742.988501966 5758458.177389245, 354742.97 5758458.19, 354742.90448312287 5758458.238112903, 354742.83751848264 5758458.284189465, 354742.76916934864 5758458.3281861525)), ((354742.4807933091 5758453.143215798, 354742.36904797406 5758453.096683949, 354742.255514648 5758453.054702802, 354742.1403785658 5758453.01734085, 354742.23469866544 5758453.048476773, 354742.32793967956 5758453.082708925, 354742.42 5758453.12, 354742.44034595153 5758453.127524007, 354742.4606111444 5758453.135262895, 354742.4807933091 5758453.143215798)), ((354743.56067013886 5758453.949935026, 354743.47363280016 5758453.851548408, 354743.382367441 5758453.757070585, 354743.28704942006 5758453.666683088, 354743.380944933 5758453.758478318, 354743.4721767185 5758453.852921386, 354743.56067013886 5758453.949935026)), ((354741.50910225126 5758452.898420274, 354741.6329594108 5758452.913596253, 354741.75649912434 5758452.931170253, 354741.87967496796 5758452.95113567, 354741.7569187535 5758452.928421285, 354741.633323018 5758452.9108393155, 354741.50910225126 5758452.898420274)), ((354742.87 5758453.35, 354743.00843700604 5758453.446922913, 354743.143314634 5758453.548740796, 354743.2744598884 5758453.655323056, 354743.1454582713 5758453.545990117, 354743.01049506286 5758453.444107629, 354742.87 5758453.35), (354741.47 5758458.76, 354741.69640811626 5758458.730853419, 354741.9199317863 5758458.6845141575, 354742.1392674122 5758458.621252467, 354742.09311297646 5758458.635119098, 354742.04668400256 5758458.648036752, 354742 5758458.66, 354741.91386881284 5758458.682274899, 354741.8271822026 5758458.702279502, 354741.74 5758458.72, 354741.6503692072 5758458.735750736, 354741.5603475359 5758458.74908728, 354741.47 5758458.76)), ((354743.91 5758454.45, 354743.82046237413 5758454.307324911, 354743.7278918259 5758454.166904364, 354743.63 5758454.03, 354743.676666975 5758454.1000003815, 354743.7233338356 5758454.170000792, 354743.77 5758454.24, 354743.8166666031 5758454.309999824, 354743.8633327484 5758454.37999928, 354743.91 5758454.45), (354743.5687541657 5758457.608473395, 354743.7185425014 5758457.395727332, 354743.849262841 5758457.170761499, 354743.959924779 5758456.935280356, 354744.04968988255 5758456.691068032, 354744.0221770005 5758456.768298324, 354743.9922706891 5758456.844633698, 354743.96 5758456.92, 354743.9258946364 5758457.001157474, 354743.8892155034 5758457.081184673, 354743.85 5758457.16, 354743.8060313074 5758457.241553628, 354743.75934706524 5758457.321583757, 354743.71 5758457.4, 354743.6652958208 5758457.4710824955, 354743.6181956441 5758457.540600536, 354743.5687541657 5758457.608473395)))], cause: found non-noded intersection between LINESTRING ( 354741.47 5758458.76, 354741.69640811626 5758458.730853419 ) and LINESTRING ( 354741.6503692072 5758458.735750736, 354741.5603475359 5758458.74908728 ) [ (354741.5973395641, 5758458.743606979, NaN) ]

Question. Why does intersection cause this error at the first place? Why can't I just filter it all. Reads as the intersection is a point.

  1. Why does that throw an error?
  2. I would like to keep only geometries of the original (Multi-)Polygon Type.

Settings

Using the latest Docker container.

Sedona version = 1.6.1

Apache Spark version = 3.4.1

Python version = 3.10.12

Environment = Docker on local machine

github-actions[bot] commented 1 week ago

Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.

atiannicelli commented 1 week ago

Here is another way to get the same issue (This comes from code that is attempted to match polygons)

select 
    ST_ISVALID(geometry2),
    ST_ISVALID(geometry1),
    ST_INTERSECTION(geometry2, geometry1) AS intersection
from (
    VALUES (
        ST_GEOMETRYFROMTEXT('MULTIPOLYGON (((-5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.985647446386958 54.93747161301757, -5.9857103 54.9374386)), ((-5.9857103 54.9374386, -5.985647446386961 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974, -5.9857103 54.9374386)), ((-5.9856469 54.9374719, -5.985647446386961 54.93747161301757, -5.985647446386958 54.93747161301757, -5.9856469 54.9374719)))'),
        ST_GEOMETRYFROMTEXT('POLYGON ((-5.985979 54.9372974, -5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974), (-5.9856469 54.9374719, -5.9855813 54.9374457, -5.98564744638696 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719))'))
) as t( geometry1 , geometry2 )

I see the same error on ST_INTERSECTION and ST_UNION. Also - I saw that if I swap geometry 1 and 2 then it works. This is a high priority issue as there is no way to detect that this will fail until it has failed. Also, this exact query succeeds on Trino.

james-willis commented 1 week ago
  1. Why does that throw an error?

Polygons containing non-noded intersectionsare not valid

I think the real question is here is why does is_valid return true when a (multi)polygon contains a self intersection. I will dig into this and reach out to JTS maintainers if need be.

atiannicelli commented 1 week ago

By the way, I already fixed the osm building. (Just in case you go to look at and and wonder)

james-willis commented 1 week ago

I am able to repro this in JTS:

import org.locationtech.jts.geom.{Geometry, GeometryFactory}
import org.locationtech.jts.io.WKTReader

def loadWKT(wktString: String): Geometry = {
    val reader = new WKTReader(new GeometryFactory())
    reader.read(wktString)
}

val geomWkt = "MULTIPOLYGON (((-5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.985647446386958 54.93747161301757, -5.9857103 54.9374386)), ((-5.9857103 54.9374386, -5.985647446386961 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974, -5.9857103 54.9374386)), ((-5.9856469 54.9374719, -5.985647446386961 54.93747161301757, -5.985647446386958 54.93747161301757, -5.9856469 54.9374719)))"
val polyWkt = "POLYGON ((-5.985979 54.9372974, -5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974), (-5.9856469 54.9374719, -5.9855813 54.9374457, -5.98564744638696 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719))"

val mp = loadWKT(geomWkt)
val p = loadWKT(polyWkt)

mp.isSimple() && mp.isValid() && p.isSimple() && p.isValid() // true

mp.intersection(p) // runs fine
p.intersection(mp) // throws similar error as above

exact error:

found non-noded intersection between LINESTRING ( -5.9855813 54.9374457, -5.985647446386959 54.93747161301757 ) and LINESTRING ( -5.98564744638696 54.93747161301757, -5.985647446386958 54.93747161301757 ) [ (-5.985647446386959, 54.93747161301757, NaN) ]

To me it seems there is some failure in JTS to identify invalid polygons. I see the same behavior in shapely (ie geos).

going to keep digging

ETA: this is JTS version 1.19.0 ETA2: exists in JTS version 1.20.0

james-willis commented 1 week ago

Edit: There was a bug in the code below so the extrapolations made from it are wrong.

I wrote a function to test for self intersections in the exterior ring of Polygons and tested the geometries from the issue. The geometry has a self intersection that JTS seems to not be detecting. I will talk to Jia today and consider writing a issue against JTS.

Function:

import org.locationtech.jts.geom.{Geometry, GeometryFactory}
import org.locationtech.jts.io.WKTReader

def loadWKT(wktString: String): Geometry = {
    val reader = new WKTReader(new GeometryFactory())
    reader.read(wktString)
}

def hasSelfIntersections(geom: Geometry) = {
    // I only wrote this to consider Polygons and only their external ring
    // Probably works for Linestrings too
    val geometryFactory = new GeometryFactory()

    val coords = geom.getCoordinates()
    val segments = (0 until coords.length - 1).map { i =>
      val startPoint = coords(i)
      val endPoint = coords(i + 1)
      geometryFactory.createLineString(Array(startPoint, endPoint))
    }

    val intersections = for {
      i <- segments.indices
      j <- i + 1 until segments.size // Avoid checking a segment against itself
      if (segments(i).intersects(segments(j)))
    } yield (segments(i).intersection(segments(j)))

    // All intersections are points
    val intersectionsArePoints = intersections.filter(x => x.getGeometryType != "Point").length == 0
    // All of those points are the coordinates in the geometry (ie NOT non-noded self-intersections)
    val intersectionsAreGeomCoords = intersections.map(x => x.getCoordinate).toSet.diff(coords.toSet).size == 0

    !(intersectionsArePoints && intersectionsAreGeomCoords)
}

Testing:

val mpWkt = "MULTIPOLYGON (((-5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.985647446386958 54.93747161301757, -5.9857103 54.9374386)), ((-5.9857103 54.9374386, -5.985647446386961 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974, -5.9857103 54.9374386)), ((-5.9856469 54.9374719, -5.985647446386961 54.93747161301757, -5.985647446386958 54.93747161301757, -5.9856469 54.9374719)))"
val pWkt= "POLYGON ((-5.985979 54.9372974, -5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974), (-5.9856469 54.9374719, -5.9855813 54.9374457, -5.98564744638696 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719))"
val mp = loadWKT(mpWkt)
val p = loadWKT(pWkt)

// multipolygon's polygons have no self intersections
(0 until mp.getNumGeometries).map(mp.getGeometryN).toArray.map(geom => hasSelfIntersections(geom)).forall(_ == false) // true

// polygon has self intersection
hasSelfIntersections(p) // true
james-willis commented 1 week ago

Edit: The code below does what it claims to do but does not catch the case in Alex's example, which is due to some behavior of JTS, I will describe a bit in a subsequent comment

~Here is a python function that can be used as a UDF in sedona to filter out these cases. As decribed in the docstring, not all cases are handled, someone could augment the function to handle those cases.~

from shapely.geometry import Polygon, LineString, Multipolygon

def has_self_intersections(geom):
  """Checks some cases of a self intersecting geometry.

  This function checks only the following cases:
  * A LineString self-intersecting
  * The exterior ring of a polygon self-intersecting. It does not consider holes.
  * The exterior ring of each polygon in a multipolygon is self-intersecting. It does not check if they intersect each other.
  """
  if isinstance(geom, (Polygon, LineString)):
      coords = list(geom.exterior.coords) if isinstance(geom, Polygon) else list(geom.coords)
      segments = []
      for i in range(len(coords) - 1):
        segments.append(LineString([coords[i], coords[i + 1]]))

      intersections = []
      for i in range(len(segments)):
        for j in range(i + 1, len(segments)):  # Avoid checking a segment against itself
          if segments[i].intersects(segments[j]):
            intersections.append(segments[i].intersection(segments[j]))

      # All intersections are points
      intersections_are_points = all(i.geom_type == 'Point' for i in intersections)
      # All of those points are the coordinates in the geometry (ie NOT non-noded self-intersections)
      intersection_coords = set([i.coords[0] for i in intersections])
      intersections_are_geom_coords = intersection_coords - set(coords)

      return not (intersections_are_points and intersections_are_geom_coords)
  elif isinstance(geom, MultiPolygon):
      return any(has_self_intersections(geom) for geom in mp.geoms)
  else:
    raise ValueError("Geometry must be a Polygon or LineString")
Kontinuation commented 1 week ago

Another approach to resolve this problem is enabling the OverlayNG feature of JTS by setting the system property jts.overlay=ng. You can configure the JTS overlay algorithm by specifying the following spark configuration options when launching the Spark session:

spark.driver.extraJavaOptions -Djts.overlay=ng
spark.executor.extraJavaOptions -Djts.overlay=ng

I've tried all the test cases mentioned in this issue and they all work with OverlayNG.

james-willis commented 1 week ago

As Kristin said, enabling OverlayNG will allow you to run intersections on the invalid geometries (Sebastian's MultiPolygon and Alex's Polygon). Its unclear to me what the meaning of an intersection on an invalid geometry might mean.

@sehHeiden, your invalid multipolygon returns false when I call the IsValid method on it in raw JTS. In your stack overflow you call IsValid or MakeValid AFTER the intersection. You need to do it before. Otherwise you're still calling  the intersection with a bad input. Be careful with MakeValid, it can give unexpected results. In a building conflation use case I would consider throwing out invalid geometries.

@atiannicelli, the polygon case seems like a bonafide bug in jts ~ IsValid functionality~. I have filed a bug with them: https://github.com/locationtech/jts/issues/1085

Edit: Further investigation seems to find that this is a lack of robustness in the legacy overlay implementation of JTS, which is why enabling JTS' OverlayNG resolves. Will write up a new comment detailing findings

atiannicelli commented 1 week ago

@james-willis thanks so much! This looks promising!

Here is what I tried, and I am seeing good results

config = (
    SedonaContext.builder()
    .config("spark.driver.extraJavaOptions", "-Djts.overlay=ng")
    .config("spark.executor.extraJavaOptions", "-Djts.overlay=ng")
    .getOrCreate()
)

spark = SedonaContext.create(config)
james-willis commented 1 week ago

I performed some more testing and revised the Scala code above. There was a bug in my code. I found that both of Alex's geometries seem to be valid, and that JTS agrees they are valid. The bug seems to be in the overlays implementation that underlies the intersection method, which in turn underlies the Sedona ST_Intersection function.

When jts.overlay is left as the default value, SnapIfNeededOverlayOp is used. When it is equal to ng, OverlayNGRobust is used. There are 3 more implementations of overlay I am aware of, OverlayOp, OverlayNG and SnapOverlayOp. These are not supported for use via the intersection method of the Geometry class. Of these 5 implementations, I found that only OverlayNGRobust works on both orderings of arguments for the geometries for intersection. The others fail on either one or both orderings of geometries

ETA: JTS issue: https://github.com/locationtech/jts/issues/1086