databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
278 stars 67 forks source link

`grid_tessalateexplode` produces geometries larger than the parent H3 cell near the antimeridian #371

Open willbowditch opened 1 year ago

willbowditch commented 1 year ago

Describe the bug

If a geometry crosses the antimeridian the geometries created by grid_tessalateexplode can be larger than the h3 cell which should contain them.

To Reproduce

import mosaic as M
import pyspark.sql.functions as F

M.enable_mosaic(spark, dbutils)

df = spark.createDataFrame(
    [{'geometry': "MULTIPOLYGON (((180 0, 179 0, 179 1, 180 1, 180 0)), ((-180 1, -179 1, -179 0, -180 0, -180 1)))"}]
)

df.select("*", M.st_area("geometry").alias("area")).createOrReplaceTempView("_example_input")

df = (
    df.select(M.grid_tessellateexplode("geometry", F.lit(6)))
    .select("index.*", M.st_area("wkb").alias("area"))
    .sort(F.col("area").desc())
)

df.limit(10).createOrReplaceTempView("_example_output")
df.show(25)

The area of non is_core is larger than is_core cells:

+-------+------------------+--------------------+--------------------+
|is_core|          index_id|                 wkb|                area|
+-------+------------------+--------------------+--------------------+
|  false|605711146439147519|[01 06 00 00 00 0...| 0.10342928257476292|
|  false|605711146976018431|[01 06 00 00 00 0...| 0.10338067061745346|
|  false|605711150868332543|[01 06 00 00 00 0...|  0.1033319174593485|
|  false|605711150465679359|[01 06 00 00 00 0...| 0.10328302356726646|
|  false|605711151002550271|[01 06 00 00 00 0...| 0.10323398940885471|
|  false|605711182812151807|[01 06 00 00 00 0...| 0.10318481545210978|
|  false|605711183349022719|[01 06 00 00 00 0...| 0.10313550216586133|
|  false|605711180798885887|[01 06 00 00 00 0...| 0.10308605001950576|
|  false|605711181335756799|[01 06 00 00 00 0...| 0.10303645948296837|
|  false|605711185228070911|[01 06 00 00 00 0...| 0.10298673102677067|
|  false|605711184825417727|[01 06 00 00 00 0...| 0.10293686512208705|
|  false|605711185362288639|[01 06 00 00 00 0...| 0.10288686224053886|
|  false|605711431920254975|[01 06 00 00 00 0...| 0.10283672285444809|
|  false|605711432457125887|[01 06 00 00 00 0...| 0.10278644743654258|
|  false|605711429906989055|[01 06 00 00 00 0...| 0.10273603646015625|
|  false|605711430443859967|[01 06 00 00 00 0...| 0.10268549039916726|
|  false|605711434336174079|[01 06 00 00 00 0...| 0.08283312925980345|
|  false|605711148989284351|[01 06 00 00 00 0...| 0.07513165549629827|
|  false|605711433933520895|[01 06 00 00 00 0...|0.050679314937415794|
|  false|605711434201956351|[01 06 00 00 00 0...|0.049993939478699206|
|  false|605711434738827263|[01 06 00 00 00 0...| 0.03515651011204795|
|  false|605711434470391807|[01 06 00 00 00 0...|0.012316107109272834|
|   true|605711247102443519|[01 03 00 00 00 0...| 0.00226359823109962|
|   true|605711247236661247|[01 03 00 00 00 0...|0.002261944180600103|
|   true|605711247639314431|[01 03 00 00 00 0...|0.002261936677298519|
+-------+------------------+--------------------+--------------------+
only showing top 25 rows

Expected behavior

The geometries should be contained within the H3 cell

Screenshots

The input geometry looks like this:

image

The output of the geometries which are larger than expected look like this:

image

Additional context

An example real-world polygon where this is a problem are geo-boundaries for Fiji, see:

https://media.githubusercontent.com/media/wmgeolab/geoBoundaries/a6ef6576d347e3885a3ec4891b95b26e76668cbe/releaseData/gbOpen/FJI/ADM0/geoBoundaries-FJI-ADM0-all.zip

thomas-maschler commented 1 year ago

The boundaries returned from H3 are not AntiMeridian safe and introduce the distortion we observe here.

df = spark.sql("SELECT st_astext(h3_boundaryaswkb(h3_longlatash3(-180,0,4))) as geometry")
df.display()

# POLYGON ((179.851079918 0.376375409, -179.955309354 0.347497759, -179.860401436 0.149076599, -179.959431042 -0.019774498, 179.847305581 0.009690979, 179.752724995 0.207421233, 179.851079918 0.376375409))

If you wrap the cell in carto.ST_AntimeridianSafeGeom it will return a Multipolygon split at the antimeridian

df = spark.sql("SELECT carto.st_astext(carto.ST_AntimeridianSafeGeom(carto.st_geomfromwkb(st_asbinary(h3_boundaryaswkb(h3_longlatash3(-180,0,4)))))) as geometry")
df.display()

# MULTIPOLYGON (((-180 -0.01359, -180 0.35417, -179.95531 0.3475, -179.8604 0.14908, -179.95943 -0.01977, -180 -0.01359)), ((180 0.35417, 180 -0.01359, 179.84731 0.00969, 179.75272 0.20742, 179.85108 0.37638, 180 0.35417)))

grid_tessellateexplode and grid_tessellate use the unsafe geometries when computing the intersection and hence generate corrupted output geometries.

@edurdevic, there is a straightforward implementation of the ST_antimeridianSafeGeom from Geomesa using JTS and spatial4j. How far do you require feature parity between JTS and ESRI geometry API? Would you be open to supporting this for JTS only? Happy to make a PR for this.

milos-colic commented 1 year ago

@thomas-maschler JTS only support for ST_antimeridianSafeGeom is perfectly fine, we are planning to standardise around a single engine going forward (JTS). I will be updating our external contribution process today, going forward it will be possible to accept external PRs. I will provide a link to the README here once it is live.

thomas-maschler commented 1 year ago

Excellent! I was wondering about the geometry API. Going with JTS should make things much easier and provides ways to integrate with other OS libraries. st_antimeridiansafe is not pole save and it would still require some custom logic, but it should simplify the PR by a lot.