apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.96k stars 695 forks source link

ST_Difference function crashes while working on some geometries #1264

Closed ruanqizhen closed 6 months ago

ruanqizhen commented 8 months ago

Behavior

I'm processing some of the OSM geometries, and encountered an exception for both ST_Difference and ST_SymDifference function.

Steps to reproduce the problem

run this query:

spark.sql(
    """
    SELECT
       ST_Difference(
           ST_GeomFromText('MULTIPOLYGON(((-71.0978928 42.3462476,-71.097854 42.34660099999999,-71.0977237 42.34672280000001,-71.0977251 42.34703010000001,-71.0970225 42.34703010000001,-71.0966997 42.346889799999985,-71.0964544 42.3464166,-71.0964974 42.3463562,-71.0966927 42.346306,-71.0969533 42.34629069999997,-71.0973551 42.346184100000016,-71.0977681 42.34614830000001,-71.0978041 42.34615439999999,-71.0978373 42.3461657,-71.0978595 42.34617900000001,-71.0978775 42.34619950000001,-71.0978858 42.3462231,-71.0978928 42.3462476),(-71.0977082 42.3462332,-71.0974766 42.34623529999999,-71.0974664 42.34622680000001,-71.0974562 42.3462193,-71.0974452 42.34621599999997,-71.0974288 42.3462136,-71.0974068 42.346215400000006,-71.0973766 42.34621959999998,-71.0973448 42.34622680000001,-71.0973269 42.34623500000001,-71.0972661 42.3462356,-71.0972661 42.346246699999995,-71.0972995 42.346247300000016,-71.0972881 42.3462639,-71.0972816 42.346278100000006,-71.0972726 42.34629860000004,-71.0972661 42.34632870000004,-71.0972628 42.34635320000001,-71.0972624 42.3463797,-71.0972657 42.346407699999986,-71.0972787 42.34644,-71.0972896 42.3464596,-71.0973031 42.346484000000004,-71.0973218 42.34650779999998,-71.0973426 42.346527400000014,-71.0973644 42.34654550000002,-71.0973946 42.34656509999999,-71.0974268 42.346582299999994,-71.0974619 42.34659640000001,-71.097497 42.34660820000002,-71.0975349 42.346615999999955,-71.0975623 42.3466181,-71.0976006 42.34661959999997,-71.097641 42.346619000000004,-71.0976614 42.3466181,-71.0976981 42.3466124,-71.0977307 42.3466052,-71.0977621 42.346595500000035,-71.0977707 42.346592799999996,-71.0977735 42.34661840000001,-71.0977899 42.3466181,-71.0977878 42.34656899999999,-71.0978009 42.34655360000002,-71.0978074 42.34653829999999,-71.0978139 42.34651109999996,-71.0978151 42.3464912,-71.0978033 42.3464692,-71.0977878 42.34645810000001,-71.0977833 42.346288199999975,-71.0977968 42.34628480000001,-71.0978094 42.346279100000004,-71.0978213 42.34626920000002,-71.0978298 42.34625320000001,-71.0978311 42.34623989999997,-71.097827 42.346225799999985,-71.0978192 42.34621580000001,-71.097807 42.3462059,-71.0977911 42.34619950000001,-71.0977748 42.34619620000001,-71.0977564 42.3461968,-71.0977417 42.34620010000003,-71.0977283 42.34620619999998,-71.0977197 42.34621280000002,-71.0977111 42.346221499999984,-71.0977082 42.3462332),(-71.0978448 42.346496900000005,-71.0978422 42.34644080000001,-71.0978257 42.34644119999999,-71.0978283 42.34649730000001,-71.0978448 42.346496900000005),(-71.0975012 42.3462035,-71.0975011 42.34619219999999,-71.0974257 42.34619270000002,-71.0974259 42.34620399999997,-71.0975012 42.3462035)),((-71.0956691 42.346659200000005,-71.095776 42.34663069999999,-71.0957691 42.3466133,-71.095754 42.3465789,-71.0957458 42.34656380000001,-71.0956338 42.3465922,-71.0954915 42.34633199999999,-71.0956146 42.34621469999999,-71.095645 42.3461858,-71.0958218 42.346544799999975,-71.0958794 42.3465291,-71.0959138 42.34651969999999,-71.0958505 42.34639350000003,-71.0958936 42.34632269999997,-71.0959829 42.34617589999996,-71.096042 42.34607869999999,-71.098156 42.3455084,-71.0981646 42.345526500000034,-71.0981776 42.34555209999999,-71.0982018 42.3455979,-71.0982132 42.34562059999999,-71.0982492 42.345692299999996,-71.0982652 42.345724099999984,-71.0985685 42.346323499999954,-71.0988082 42.34680320000001,-71.0987779 42.346853399999986,-71.0985196 42.34709509999999,-71.0983085 42.3470973,-71.0980514 42.34709960000001,-71.0980515 42.3471198,-71.0978489 42.34712179999997,-71.0978267 42.34712059999998,-71.097826 42.34714120000001,-71.0978255 42.3471557,-71.0977857 42.347155799999996,-71.0968676 42.34715829999999,-71.096866 42.3471251,-71.0967996 42.34712490000001,-71.0964846 42.347123899999985,-71.0964634 42.34712339999999,-71.0964618 42.34710089999999,-71.0960882 42.3470978,-71.0958558 42.34661220000001,-71.0956691 42.346659200000005),(-71.0965049 42.346867299999985,-71.0964854 42.34689119999999,-71.0969005 42.3470643,-71.0969397 42.347100100000006,-71.0977894 42.347098500000016,-71.0977862 42.34674050000001,-71.0979186 42.346618500000005,-71.0979231 42.34657709999999,-71.0979389 42.346431800000005,-71.0979593 42.346230899999995,-71.0979322 42.34617269999998,-71.0978613 42.346124300000014,-71.0977835 42.3461083,-71.0975181 42.34612390000001,-71.0973236 42.34614139999999,-71.0971148 42.34620189999998,-71.0971038 42.34619079999999,-71.09691 42.34624529999999,-71.0966576 42.346251300000006,-71.096472 42.34630279999999,-71.0964351 42.34631490000004,-71.0964176 42.3463247,-71.0964021 42.346337199999965,-71.0963898 42.346351999999996,-71.0963813 42.34636710000004,-71.0963767 42.346381399999984,-71.0963751 42.34640559999997,-71.0963833 42.34643109999999,-71.0963165 42.34645019999999,-71.0963025 42.346454100000045,-71.096334 42.346508099999994,-71.0965049 42.346867299999985),(-71.0978244 42.345882399999994,-71.0978319 42.34582399999999,-71.0978439 42.3457607,-71.0971974 42.3458153,-71.0971806 42.34578429999999,-71.0968599 42.34587120000003,-71.0968806 42.345911,-71.0969008 42.34594959999998,-71.0969132 42.34597339999999,-71.0970628 42.3459335,-71.0972002 42.34591080000001,-71.0976577 42.3458746,-71.0978244 42.345882399999994),(-71.0985254 42.3462691,-71.0984631 42.346283400000004,-71.0984151 42.34665789999997,-71.0985285 42.34666390000004,-71.0987011 42.346617099999975,-71.0985254 42.3462691)),((-71.0964031 42.3464687,-71.0964111 42.346484000000004,-71.0964611 42.34658490000004,-71.0963922 42.3466037,-71.0963346 42.3464869,-71.0964031 42.3464687)),((-71.0957869 42.3461475,-71.095645 42.3461858,-71.0956805 42.3461762,-71.0957869 42.3461475)))'),

           ST_GeomFromText('MULTIPOLYGON(((-71.0958218 42.346544799999975,-71.095645 42.3461858,-71.0956805 42.3461762,-71.096042 42.34607869999999,-71.0959829 42.34617589999996,-71.0958936 42.34632269999997,-71.0958505 42.34639350000003,-71.0959138 42.34651969999999,-71.0958794 42.3465291,-71.0958218 42.346544799999975)),((-71.0958218 42.346544799999975,-71.0957458 42.34656380000001,-71.0956338 42.3465922,-71.0954915 42.34633199999999,-71.0956146 42.34621469999999,-71.095645 42.3461858,-71.0958218 42.346544799999975)))')
       ) AS geom
    """
).show(truncate = False)

Tacktrace:

File "<stdin>", line 1, in <module>
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 423, in show
    print(self._jdf.showString(n, int_truncate, vertical))
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o55.showString.
: org.locationtech.jts.geom.TopologyException: Directed Edge visited twice during ring-building at (-71.0958218, 42.346544799999975, NaN)
    at org.locationtech.jts.geomgraph.EdgeRing.computePoints(EdgeRing.java:126)
    at org.locationtech.jts.geomgraph.EdgeRing.<init>(EdgeRing.java:51)
    at org.locationtech.jts.operation.overlay.MaximalEdgeRing.<init>(MaximalEdgeRing.java:46)
    at org.locationtech.jts.operation.overlay.PolygonBuilder.buildMaximalEdgeRings(PolygonBuilder.java:93)
    at org.locationtech.jts.operation.overlay.PolygonBuilder.add(PolygonBuilder.java:67)
    at org.locationtech.jts.operation.overlay.PolygonBuilder.add(PolygonBuilder.java:56)
    at org.locationtech.jts.operation.overlay.OverlayOp.computeOverlay(OverlayOp.java:248)
    at org.locationtech.jts.operation.overlay.OverlayOp.getResultGeometry(OverlayOp.java:181)
    at org.locationtech.jts.operation.overlay.OverlayOp.overlayOp(OverlayOp.java:84)
    at org.locationtech.jts.operation.overlay.snap.SnapIfNeededOverlayOp.getResultGeometry(SnapIfNeededOverlayOp.java:75)
    at org.locationtech.jts.operation.overlay.snap.SnapIfNeededOverlayOp.overlayOp(SnapIfNeededOverlayOp.java:37)
    at org.locationtech.jts.geom.GeometryOverlay.overlay(GeometryOverlay.java:76)
    at org.locationtech.jts.geom.GeometryOverlay.difference(GeometryOverlay.java:89)
    at org.locationtech.jts.geom.Geometry.difference(Geometry.java:1391)
    at org.apache.sedona.common.Functions.difference(Functions.java:551)
    at org.apache.spark.sql.sedona_sql.expressions.ST_Difference$$anonfun$$lessinit$greater$60.apply(Functions.scala:727)
    at org.apache.spark.sql.sedona_sql.expressions.ST_Difference$$anonfun$$lessinit$greater$60.apply(Functions.scala:727)
    at org.apache.spark.sql.sedona_sql.expressions.InferrableFunctionConverter$.$anonfun$inferrableFunction2$2(InferrableFunctionConverter.scala:53)
    at org.apache.spark.sql.sedona_sql.expressions.InferredExpression.eval(InferredExpression.scala:70)
    at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:483)
    at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:69)
    at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:57)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:524)
    at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1150)
    at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1149)
    at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:473)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:524)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.immutable.List.map(List.scala:305)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:362)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152)
    at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1.applyOrElse(expressions.scala:57)
    at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1.applyOrElse(expressions.scala:55)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformWithPruning(TreeNode.scala:485)
    at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.apply(expressions.scala:55)
    at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.apply(expressions.scala:45)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:215)
    at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
    at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
    at scala.collection.immutable.List.foldLeft(List.scala:91)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:212)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$6(RuleExecutor.scala:284)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$RuleExecutionContext$.withContext(RuleExecutor.scala:327)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5(RuleExecutor.scala:284)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5$adapted(RuleExecutor.scala:274)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:274)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:188)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:146)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:224)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:224)
    at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:142)
    at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:138)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$4(QueryExecution.scala:306)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:606)
    at org.apache.spark.sql.execution.QueryExecution.writePlans(QueryExecution.scala:306)
    at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:323)
    at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:277)
    at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:256)
    at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:102)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
    at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3768)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2769)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2976)
    at org.apache.spark.sql.Dataset.getRows(Dataset.scala:289)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:328)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

Settings

Sedona version = 1.5.0

Apache Spark version = 3.2.1

Apache Flink version = ?

API type = Python

Scala version =

JRE version =

Python version = 3.10

Environment = AWS Athena

jiayuasu commented 8 months ago

@ruanqizhen The polygon is invalid for Difference operation because it has nested holes

image

You can use ST_ExteriorRing (https://sedona.apache.org/1.5.1/api/sql/Function/#st_exteriorring) to only use the exterior ring.

A similar issue has been reported on JTS repo: https://github.com/locationtech/jts/issues/303

ruanqizhen commented 8 months ago

@ruanqizhen The polygon is invalid for Difference operation because it has nested holes

image

You can use ST_ExteriorRing (https://sedona.apache.org/1.5.1/api/sql/Function/#st_exteriorring) to only use the exterior ring.

A similar issue has been reported on JTS repo: locationtech/jts#303

Hi Jia, thank you for the information.
I need to keep the nested holes, is there a way to process "difference" operation with the nested holes?

jiayuasu commented 8 months ago

@ruanqizhen I don't think it is possible to process the difference function with that. You have to use the exterior ring or take those nested holes out as the individual polygons