databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
278 stars 66 forks source link

unexpected behaviour of `st_dump` when MultiPolygon parts contain holes #294

Closed wcjochem closed 1 year ago

wcjochem commented 1 year ago

Describe the bug I'm trying to explode a MultiPolygon feature (originally a GeoJSON string) into its constituent polygons using st_dump. The unexpected behaviour occurs when a MultiPolygon has a part containing a hole. In such a case, the hole is not associated with the correct exploded feature. The exception to this behaviour (which highlights the issue) is that the expected result does occur when the final part of the MultiPolygon is the feature containing a hole. If all parts of the MultiPolygon contain holes, then they all become associated with the final, exploded feature.

To Reproduce Steps to reproduce the behavior:

# note: import/enable mosaic as mos
from pyspark.sql import functions as F
import json

# Construct a simple 3-part multipolygon.
# Note that the second polygon (starting [100,200]... has a hole).
geojson_dict = {"type":"MultiPolygon",
    "coordinates":[
        [[[5, 5], [0, 0], [10, 0], [5, 5]]],
        [[[100, 200], [100, 100], [200, 100], [200, 200], [100, 200]], [[175, 125], [125, 125], [125, 175], [175, 175], [175, 125]]],
                [[[25, 25], [20, 20], [30, 20], [25, 25]]]
    ]
}

# NOTE: if the second and third features in geojson_dict are swapped, the expected result occurs.

# Create a DataFrame
df = spark.createDataFrame([{'json': json.dumps(geojson_dict)}])

# Construct a simple 3-part multipolygon
df = (df
        # Convert from GeoJSON.
        .withColumn('geomGJ', mos.st_geomfromgeojson(F.col('json')))
        # Explode the parts.
        .withColumn('geomDump', mos.st_dump(F.col('geomGJ'))))

# Visually compare the array of `boundary` objects and `holes` in 'geomGJ' with 'geomDump'
# The hole is now associated with the 3rd polygon feature, but this is incorrect. It should be the second.
display(df)

Expected behavior The resulting data frame (df in the above example) should contain 3 polygon features. The second polygon should contain a hole.

Screenshots image

Additional context Using: Mosaic version 0.3.7, Databricks Runtime 11.3 LTS, Apache Spark 3.3.0, Scala 2.12

edurdevic commented 1 year ago

Thank you for reporting this @wcjochem. This is indeed a bug when using the ESRI geometry bindings. When constructing a multipolygon with ESRI we loose the order of the holes.

Please use JTS binding instead of the default ESRI one:

import mosaic as mos
spark.conf.set("spark.databricks.labs.mosaic.geometry.api", "JTS")
mos.enable_mosaic(spark, dbutils)
wcjochem commented 1 year ago

Thank you for following up and confirming, @edurdevic! Specifying JTS bindings does produce the expected behaviour.