databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
269 stars 66 forks source link

grid_longlatascellid/grid_pointascellid returns h3 hex id that don't intersect with input lon/lat #229

Open damianrakowski-tomtom opened 1 year ago

damianrakowski-tomtom commented 1 year ago

Describe the bug

grid_longlatascellid/grid_pointascellid returns h3 hex id that don't intersect with input point.

To Reproduce

import org.apache.spark.sql.functions.{col} import com.databricks.labs.mosaic.functions.MosaicContext import com.databricks.labs.mosaic.H3 import com.databricks.labs.mosaic.ESRI import org.apache.spark.sql.functions._ import org.apache.spark.sql.{DataFrame, Dataset, SparkSession, Column, Row}

val mosaicContext = MosaicContext.build(H3, ESRI) mosaicContext.register(spark) import mosaicContext.functions._

val data = Seq((-120.65246800000001, 40.420067)) val rdd = spark.sparkContext.parallelize(data) rdd.toDF("lon", "lat") .withColumn("point", st_point(col("lon"), col("lat"))) .withColumn("hexPointId", grid_pointascellid(col("point"), 8)) .withColumn("hexId", grid_longlatascellid(col("lon"), col("lat"), 8)) .withColumn("hexGeom", st_geomfromwkb(grid_boundaryaswkb(col("hexId")))) .withColumn("hexwkt", st_aswkt(col("hexGeom"))) .withColumn("intersects", st_intersects(col("point"), col("hexGeom"))) .withColumn("distance", st_distance(col("point"), st_geomfromwkb(col("hexGeom")))) .drop("lon", "lat", "hexGeom") .show(1, false)

Expected behavior Boundary of hex for a given index returned by grid_longlatascellid/grid_pointascellid should intersect with input coordinates.

Screenshots image

Additional context Databricks 10.4 LTS (Scala 2.12, Spark version 3.2.1) mosaic_0_3_0_jar_with_dependencies.jar

Point is near hex edge.

edurdevic commented 1 year ago

Thanks for reporting this. We are looking into it.

edurdevic commented 1 year ago

Thanks Damian for reporting this. It actually is an interesting issue. We identified the issue, and we also think we have a work around for it. We need some time to bake the workaround in Mosaic and make sure this does not cause any issue with the downstream analysis.

Let me try to explain what is happening here.

The H3 hexagons are built on top of the icosahedron faces, and then the hex vertices are projected on the Earth as a spherical icosahedron using a gnomonic projection. Please note that only the vertices are projected, while the lines between them are only an approximated representation. The gnomonic projection is not conformal, which means that the shape of objects is not perfectly preserved when projected from the icosahedron faces onto the sphere.

So it is possible (and easy) to accurately map a point from the spherical coordinates to the icosahedron and vice-versa. But when projecting a line, you would need to represent the line with an infinite number of points in order to map them precisely from one domain to the other, because the line in one domain is represented by a curve in the other.

When you use grid_latlonascellid, you are taking a point on the sphere, projecting it on the icosahedron, converting it to hex_id, and returning it. So if this point is very close to the hexagon edge, it can happen that the distortion of the hexagon at the sphere level makes it such that the point does not intersect the hexagon geometry on the sphere coordinates.

There is no way to get around the projection issues, but there is a relatively simple way to identify this situation so that this hexagon approximation does not impact any result of a grid_tessellate operation on border chip analysis. We will get back with a code example on this.

edurdevic commented 1 year ago

This problem occurs when using the following approach in point-in-polygon joins

  1. tessellating a polygon
  2. joining with a set of indexed points to the tessellated polygons by cell ID
  3. applying st_contains between point and the chip geometry

In a small percentage of cases, the step #3 can incorrectly classify the point as "outside" the chip, while it should be "inside". The work around that consists in doing a few extra checks for the points that match the index ID but do not intersect the chip. This is the condition to re-classify them as "internal"

When those two conditions are true, we can assume that the point was incorrectly classified.