Open sehHeiden opened 1 month ago
Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.
Here is another way to get the same issue (This comes from code that is attempted to match polygons)
select
ST_ISVALID(geometry2),
ST_ISVALID(geometry1),
ST_INTERSECTION(geometry2, geometry1) AS intersection
from (
VALUES (
ST_GEOMETRYFROMTEXT('MULTIPOLYGON (((-5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.985647446386958 54.93747161301757, -5.9857103 54.9374386)), ((-5.9857103 54.9374386, -5.985647446386961 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974, -5.9857103 54.9374386)), ((-5.9856469 54.9374719, -5.985647446386961 54.93747161301757, -5.985647446386958 54.93747161301757, -5.9856469 54.9374719)))'),
ST_GEOMETRYFROMTEXT('POLYGON ((-5.985979 54.9372974, -5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974), (-5.9856469 54.9374719, -5.9855813 54.9374457, -5.98564744638696 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719))'))
) as t( geometry1 , geometry2 )
I see the same error on ST_INTERSECTION and ST_UNION. Also - I saw that if I swap geometry 1 and 2 then it works. This is a high priority issue as there is no way to detect that this will fail until it has failed. Also, this exact query succeeds on Trino.
- Why does that throw an error?
Polygons containing non-noded intersectionsare not valid
I think the real question is here is why does is_valid return true when a (multi)polygon contains a self intersection. I will dig into this and reach out to JTS maintainers if need be.
By the way, I already fixed the osm building. (Just in case you go to look at and and wonder)
I am able to repro this in JTS:
import org.locationtech.jts.geom.{Geometry, GeometryFactory}
import org.locationtech.jts.io.WKTReader
def loadWKT(wktString: String): Geometry = {
val reader = new WKTReader(new GeometryFactory())
reader.read(wktString)
}
val geomWkt = "MULTIPOLYGON (((-5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.985647446386958 54.93747161301757, -5.9857103 54.9374386)), ((-5.9857103 54.9374386, -5.985647446386961 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974, -5.9857103 54.9374386)), ((-5.9856469 54.9374719, -5.985647446386961 54.93747161301757, -5.985647446386958 54.93747161301757, -5.9856469 54.9374719)))"
val polyWkt = "POLYGON ((-5.985979 54.9372974, -5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974), (-5.9856469 54.9374719, -5.9855813 54.9374457, -5.98564744638696 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719))"
val mp = loadWKT(geomWkt)
val p = loadWKT(polyWkt)
mp.isSimple() && mp.isValid() && p.isSimple() && p.isValid() // true
mp.intersection(p) // runs fine
p.intersection(mp) // throws similar error as above
exact error:
found non-noded intersection between LINESTRING ( -5.9855813 54.9374457, -5.985647446386959 54.93747161301757 ) and LINESTRING ( -5.98564744638696 54.93747161301757, -5.985647446386958 54.93747161301757 ) [ (-5.985647446386959, 54.93747161301757, NaN) ]
To me it seems there is some failure in JTS to identify invalid polygons. I see the same behavior in shapely (ie geos).
going to keep digging
ETA: this is JTS version 1.19.0 ETA2: exists in JTS version 1.20.0
Edit: There was a bug in the code below so the extrapolations made from it are wrong.
I wrote a function to test for self intersections in the exterior ring of Polygons and tested the geometries from the issue. The geometry has a self intersection that JTS seems to not be detecting. I will talk to Jia today and consider writing a issue against JTS.
Function:
import org.locationtech.jts.geom.{Geometry, GeometryFactory}
import org.locationtech.jts.io.WKTReader
def loadWKT(wktString: String): Geometry = {
val reader = new WKTReader(new GeometryFactory())
reader.read(wktString)
}
def hasSelfIntersections(geom: Geometry) = {
// I only wrote this to consider Polygons and only their external ring
// Probably works for Linestrings too
val geometryFactory = new GeometryFactory()
val coords = geom.getCoordinates()
val segments = (0 until coords.length - 1).map { i =>
val startPoint = coords(i)
val endPoint = coords(i + 1)
geometryFactory.createLineString(Array(startPoint, endPoint))
}
val intersections = for {
i <- segments.indices
j <- i + 1 until segments.size // Avoid checking a segment against itself
if (segments(i).intersects(segments(j)))
} yield (segments(i).intersection(segments(j)))
// All intersections are points
val intersectionsArePoints = intersections.filter(x => x.getGeometryType != "Point").length == 0
// All of those points are the coordinates in the geometry (ie NOT non-noded self-intersections)
val intersectionsAreGeomCoords = intersections.map(x => x.getCoordinate).toSet.diff(coords.toSet).size == 0
!(intersectionsArePoints && intersectionsAreGeomCoords)
}
Testing:
val mpWkt = "MULTIPOLYGON (((-5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.985647446386958 54.93747161301757, -5.9857103 54.9374386)), ((-5.9857103 54.9374386, -5.985647446386961 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974, -5.9857103 54.9374386)), ((-5.9856469 54.9374719, -5.985647446386961 54.93747161301757, -5.985647446386958 54.93747161301757, -5.9856469 54.9374719)))"
val pWkt= "POLYGON ((-5.985979 54.9372974, -5.9857103 54.9374386, -5.9856866 54.9374228, -5.985751 54.9373909, -5.9856392 54.9373165, -5.9855676 54.937352, -5.9856475 54.9374051, -5.9855813 54.9374457, -5.9854978 54.9374129, -5.9854129 54.9374842, -5.9854322 54.9375123, -5.9856833 54.9376109, -5.9857849 54.9375256, -5.9857909 54.9375201, -5.9857555 54.9374875, -5.9859626 54.9373794, -5.9859873 54.9373949, -5.9860307 54.9373721, -5.9860036 54.9373551, -5.9860402 54.9373359, -5.985979 54.9372974), (-5.9856469 54.9374719, -5.9855813 54.9374457, -5.98564744638696 54.93747161301757, -5.9856492 54.9374723, -5.9857264 54.9375026, -5.9856469 54.9374719))"
val mp = loadWKT(mpWkt)
val p = loadWKT(pWkt)
// multipolygon's polygons have no self intersections
(0 until mp.getNumGeometries).map(mp.getGeometryN).toArray.map(geom => hasSelfIntersections(geom)).forall(_ == false) // true
// polygon has self intersection
hasSelfIntersections(p) // true
Edit: The code below does what it claims to do but does not catch the case in Alex's example, which is due to some behavior of JTS, I will describe a bit in a subsequent comment
~Here is a python function that can be used as a UDF in sedona to filter out these cases. As decribed in the docstring, not all cases are handled, someone could augment the function to handle those cases.~
from shapely.geometry import Polygon, LineString, Multipolygon
def has_self_intersections(geom):
"""Checks some cases of a self intersecting geometry.
This function checks only the following cases:
* A LineString self-intersecting
* The exterior ring of a polygon self-intersecting. It does not consider holes.
* The exterior ring of each polygon in a multipolygon is self-intersecting. It does not check if they intersect each other.
"""
if isinstance(geom, (Polygon, LineString)):
coords = list(geom.exterior.coords) if isinstance(geom, Polygon) else list(geom.coords)
segments = []
for i in range(len(coords) - 1):
segments.append(LineString([coords[i], coords[i + 1]]))
intersections = []
for i in range(len(segments)):
for j in range(i + 1, len(segments)): # Avoid checking a segment against itself
if segments[i].intersects(segments[j]):
intersections.append(segments[i].intersection(segments[j]))
# All intersections are points
intersections_are_points = all(i.geom_type == 'Point' for i in intersections)
# All of those points are the coordinates in the geometry (ie NOT non-noded self-intersections)
intersection_coords = set([i.coords[0] for i in intersections])
intersections_are_geom_coords = intersection_coords - set(coords)
return not (intersections_are_points and intersections_are_geom_coords)
elif isinstance(geom, MultiPolygon):
return any(has_self_intersections(geom) for geom in mp.geoms)
else:
raise ValueError("Geometry must be a Polygon or LineString")
Another approach to resolve this problem is enabling the OverlayNG feature of JTS by setting the system property jts.overlay=ng
. You can configure the JTS overlay algorithm by specifying the following spark configuration options when launching the Spark session:
spark.driver.extraJavaOptions -Djts.overlay=ng
spark.executor.extraJavaOptions -Djts.overlay=ng
I've tried all the test cases mentioned in this issue and they all work with OverlayNG.
As Kristin said, enabling OverlayNG will allow you to run intersections on the invalid geometries (Sebastian's MultiPolygon and Alex's Polygon). Its unclear to me what the meaning of an intersection on an invalid geometry might mean.
@sehHeiden, your invalid multipolygon returns false when I call the IsValid method on it in raw JTS. In your stack overflow you call IsValid or MakeValid AFTER the intersection. You need to do it before. Otherwise you're still calling the intersection with a bad input. Be careful with MakeValid, it can give unexpected results. In a building conflation use case I would consider throwing out invalid geometries.
@atiannicelli, the polygon case seems like a bonafide bug in jts ~ IsValid functionality~. I have filed a bug with them: https://github.com/locationtech/jts/issues/1085
Edit: Further investigation seems to find that this is a lack of robustness in the legacy overlay implementation of JTS, which is why enabling JTS' OverlayNG resolves. Will write up a new comment detailing findings
@james-willis thanks so much! This looks promising!
Here is what I tried, and I am seeing good results
config = (
SedonaContext.builder()
.config("spark.driver.extraJavaOptions", "-Djts.overlay=ng")
.config("spark.executor.extraJavaOptions", "-Djts.overlay=ng")
.getOrCreate()
)
spark = SedonaContext.create(config)
I performed some more testing and revised the Scala code above. There was a bug in my code. I found that both of Alex's geometries seem to be valid, and that JTS agrees they are valid. The bug seems to be in the overlays implementation that underlies the intersection method, which in turn underlies the Sedona ST_Intersection function.
When jts.overlay is left as the default value, SnapIfNeededOverlayOp
is used. When it is equal to ng, OverlayNGRobust is used. There are 3 more implementations of overlay I am aware of, OverlayOp, OverlayNG and SnapOverlayOp. These are not supported for use via the intersection method of the Geometry class. Of these 5 implementations, I found that only OverlayNGRobust works on both orderings of arguments for the geometries for intersection. The others fail on either one or both orderings of geometries
ETA: JTS issue: https://github.com/locationtech/jts/issues/1086
@jiayuasu should we consider adding some note to the documentation about this and close?
@james-willis We should. Add it to the API doc of the corresponding SQL function?
Expected behaviour
After the intersection. I expect that in sql ST_ISValid(geom) = false will list all problematic geometry, or that ST_MakeValid would fix them.
Actual behaviour
Instead, the filtering/fixing an exception is caught.
Steps to reproduce the problem
You can also see the error and the attempts for fixing at stackoverflow
Data Source: https://data.geobasis-bb.de/geobasis/daten/alkis/Vektordaten/shape/
I download data for all counties, unpacked them, loaded them. Did the intersection.
I could neither, count the data, nor make them ST_MakeValid, filter them with ST_IsValid and show.
The show causes the error:
Question. Why does intersection cause this error at the first place? Why can't I just filter it all. Reads as the intersection is a point.
Settings
Using the latest Docker container.
Sedona version = 1.6.1
Apache Spark version = 3.4.1
Python version = 3.10.12
Environment = Docker on local machine