databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
278 stars 66 forks source link

grid_tessellate fails to return results for specific h3 resolution #260

Closed evisser closed 1 year ago

evisser commented 1 year ago

Describe the bug When using the grid_tessellate function to find the h3 cells covering a polygon it sometimes happens that the resulting list is empty (while there are h3 cells overlapping the polygon).

To Reproduce Steps to reproduce the behavior:

%pip install databricks-mosaic=="0.3.4"

import pyspark.sql.functions as F
from mosaic import enable_mosaic
import mosaic as mos

enable_mosaic(spark, dbutils)

df = spark.createDataFrame(data=[{"geometry_wkt": "POLYGON ((5.26 52.72, 5.20 52.71, 5.21 52.75, 5.26 52.75, 5.26 52.72))"}])
df = df.withColumn("tessellate_results_2", mos.grid_tessellate("geometry_wkt", F.lit(2)))
df = df.withColumn("tessellate_results_3", mos.grid_tessellate("geometry_wkt", F.lit(3)))
df = df.withColumn("tessellate_results_4", mos.grid_tessellate("geometry_wkt", F.lit(4)))
df.drop("geometry_wkt").display()

Expected behavior The h3 cells covering the polygon are returned.

Screenshots image

In the screenshot it can be seen that resolutions 2 and 4 give results, but that for resolution 3 the list is empty. For this example the h3 cell with id 831968fffffffff (or 590418571381702655) should have been returned.

Additional context Databricks 10.4 LTS (Scala 2.12, Spark version 3.2.1) mosaic 0.3.4.

tdikland commented 1 year ago

This issues seems to only happen with the ESRI geometry API. Switching to JTS solved this particular issue for me. You can switch the geometry API by setting the spark config spark.databricks.labs.mosaic.geometry.api to JTS.

In your example this would lead to

import pyspark.sql.functions as F
from mosaic import enable_mosaic
import mosaic as mos

spark.conf.set("spark.databricks.labs.mosaic.geometry.api", "JTS")
enable_mosaic(spark, dbutils)

df = spark.createDataFrame(data=[{"geometry_wkt": "POLYGON ((5.26 52.72, 5.20 52.71, 5.21 52.75, 5.26 52.75, 5.26 52.72))"}])
df = df.withColumn("tessellate_results_2", mos.grid_tessellate("geometry_wkt", F.lit(2)))
df = df.withColumn("tessellate_results_3", mos.grid_tessellate("geometry_wkt", F.lit(3)))
df = df.withColumn("tessellate_results_4", mos.grid_tessellate("geometry_wkt", F.lit(4)))
df.drop("geometry_wkt").show(vertical=True)

I hope that this will unblock your use case for now. In the meantime I will investigate why this issue appears when using the ESRI geometry.

evisser commented 1 year ago

Thanks for the reply and the suggestion. While it does work in the example, it unfortunately doesn't work for my use case. I get errors like org.locationtech.jts.geom.TopologyException: found non-noded intersection between LINESTRING ( 4.090640568641794 51.96737983034389, 4.109481737746743 51.958602911756174 ) and LINESTRING ( 4.1094817377467425 51.958602911756174, 4.0906405686418 51.96737983034389 ) [ (4.107125568077056, 51.95970050337878, NaN) ] (the same code ran fine with the ESRI api).