databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
278 stars 66 forks source link

Chips generated in JTS mode are incorrect when input geometry aligns with a grid boundary #309

Closed tompeterken-os closed 1 year ago

tompeterken-os commented 1 year ago

Describe the bug When a Polygon is being chipped by grid_tessellate or grid_tessellateexplode and the input geometry

  1. has at least one segment which aligns with a grid cell boundary, AND
  2. should return a MultiPolygon chip in that grid cell,

the returned chip features only a single Polygon representing just one of the parts that should be returned.

The bug only appears using the JTS bindings. In ESRI mode, a MultiPolygon is returned as expected.

To Reproduce I've generated an example below, illustrating behaviour with two different test input geometries. I've output the chips into a geodataframe to illustrate with matplotlib and show the missing segments.

import mosaic as mos
from pyspark.sql import functions as F
import geopandas as gpd
from shapely import wkt
from matplotlib import pyplot as plt

# Use JTS bindings
spark.conf.set("spark.databricks.labs.mosaic.geometry.api", "JTS")
# Use BNG indexing
spark.conf.set('spark.databricks.labs.mosaic.index.system', 'BNG')
mos.enable_mosaic(spark, dbutils)

# Define some polygon geometries
# Complex: Features a segment along a grid cell boundary (vertices 2-3), expect multipolygon within ST3576
complex_wkt = "POLYGON ((335500 177000, 336000 176700, 336000 176500, 335500 175800, 334800 176500, 334800 175500, 336200 175500, 336200 177000, 335500 177000))"

# Same as above, but offset eastwards by 50 metres so that there's no alignment with grid cell boundary
complex_offset_wkt = "POLYGON ((335550 177000, 336050 176700, 336050 176500, 335550 175800, 334850 176500, 334850 175500, 336250 175500, 336250 177000, 335550 177000))"

for test_wkt in [complex_wkt, complex_offset_wkt, simple_wkt]:

  # Create a DataFrame
  df = (
    spark.createDataFrame([{'feature_wkt': test_wkt}])

    # Explode mosaic chips
    .withColumn(
      'mosaic_chip',
      mos.grid_tessellateexplode(F.col('feature_wkt'), F.lit('1km'), F.lit(True))
    )

    # Extract chip geometry to wkt
    .withColumn(
      'chip_geometry',
      mos.st_astext(mos.st_geomfromwkb('mosaic_chip.wkb'))
    )
  )

  # Create a pandas dataframe for the original and the chip geometries
  pdf_original = df.select('feature_wkt').dropDuplicates().toPandas()
  pdf_chips = df.select('chip_geometry', F.col('mosaic_chip')['index_id'].alias('bng')).toPandas()

  # Create shapely geometries from WKT
  pdf_original['geometry'] = pdf_original['feature_wkt'].apply(wkt.loads)
  pdf_chips['geometry'] = pdf_chips['chip_geometry'].apply(wkt.loads)

  # Create geodataframes
  gdf_original = gpd.GeoDataFrame(pdf_original, crs='epsg:27700')
  gdf_chips = gpd.GeoDataFrame(pdf_chips, crs='epsg:27700')

  # Create figure with side-by-side axes
  f, axs = plt.subplots(figsize=(7,4), ncols=2, nrows=1, sharex=True, sharey=True, gridspec_kw={'wspace':0, 'hspace':0})
  axs[0].set_title('Input geometry')
  axs[1].set_title('Chip geometries')

  # Add grid cell lines
  [ax.axhline(y=176000, linestyle=':', color='k') for ax in axs]
  [ax.axhline(y=177000, linestyle=':', color='k') for ax in axs]
  [ax.axvline(x=335000, linestyle=':', color='k') for ax in axs]
  [ax.axvline(x=336000, linestyle=':', color='k') for ax in axs]

  # Add original geometry to left axes
  gdf_original.plot(ax=axs[0], edgecolor='k', facecolor='b', alpha=0.6, zorder=1)

  # Add chip geometries to right axes, with original as outline
  gdf_chips.plot(ax=axs[1], edgecolor='k', facecolor='y', alpha=0.6, zorder=5)
  gdf_original.plot(ax=axs[1], edgecolor='b', facecolor='none', zorder=1)

The geometries get chipped as shown: Firstly, the geometry with a segment aligning to the grid cell boundary. Two triangle parts are missing from the chip geometries. image

Secondly, the same geometry but shifted so that the boundary does not align with the grid cell boundary. No parts are missing. image

Expected behavior The union of chip geometries should always have equal coverage to the input geometries, regardless of whether the input geometry has a segment aligning to a grid cell boundary.

Additional context I'm using mosaic 0.3.7 with Databricks Runtime 11.3 LTS ML, Apache Spark 3.3.0, Scala 2.12. I'm using BNG indexing.

milos-colic commented 1 year ago

@tompeterken-os Thank you for reporting this. We will prioritise the fix for the next release.

milos-colic commented 1 year ago

@tompeterken-os the problem was due to the fact intersection via geometry overlay in JTS returns GeometryCollection instead of MultiPolygon (expected) and we flatten this case to a single polygon, rather than to MultiPolygon. I will open a PR now that fixes this problem.

tompeterken-os commented 1 year ago

Just got round to checking this again. Fixed in version 0.3.10, thank you @milos-colic and team!