databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
280 stars 67 forks source link

`grid_tessellate` SQL function returns empty chips #373

Open CICDamen opened 1 year ago

CICDamen commented 1 year ago

Describe the bug When using the grid_tessellate function in SQL on valid geometry, it returns empty chips.

To Reproduce Steps to reproduce the behavior:

  1. Register SQL functions from Mosaic using this manual
  2. Use the grid_tessellate function on valid geometry data to create a new column
  3. Check the chips list of this new column, which is empty

Expected behavior I would expect the chips to always contain data and not be empty

Screenshots

select

  st_setsrid(st_geomfromwkb(geometrie), 28992) as geom,
  st_isvalid(st_setsrid(st_geomfromwkb(geometrie), 28992)) as is_valid,
  grid_tessellate(st_setsrid(st_geomfromwkb(geometrie), 28992), 2) as idx

from table

image

Additional context I've tried switching the following Spark config settings on/off (both in the notebook and in the cluster during startup)

spark.databricks.mosaic.geometry.api ESRI spark.databricks.labs.mosaic.index.system CUSTOM(0, 310000, 280000, 640000, 2, 2000, 2000)

I can imagine that the custom index system is important but am unsure where/when to set it make sure the correct grid is used.

Other SQL functions like st_centroid, st_setsrid, st_geomfromwkb, st_aswkb work as expected.

milos-colic commented 1 year ago

Hi @casperdamen123 thank you for reporting the issue. Could you please confirm in which CRS are your geometries, so that we could reproduce the problem in tests and produce a fix.

CICDamen commented 1 year ago

Hi @milos-colic, you're welcome, thanks for investigating!

The geometries are in https://epsg.io/28992 and we set this using st_setsrid(st_geomfromwkb(geometrie), 28992))

milos-colic commented 1 year ago

@casperdamen123 thanks for confirming this. H3 is only concious of geometries in 4326 and silently fails (returns empty set) if geometries aren't in 4326. We plan to add support fo automatically handling this in the future releases. (ETA one or two versions) However, at the moment you'd need to transform geometries for tesselation to 4326. You can use https://databrickslabs.github.io/mosaic/api/spatial-functions.html#st-transform to do so. I will open a ticket on our internal dev JIRA for adding support for this.

CICDamen commented 1 year ago

@milos-colic, thanks for your response.

I'm not sure if I fully understand, does this fix then only relate to the SQL bindings?

Because when using the grid_tesselate function in PySpark, I do get valid chips returned. Even when I'm using another CRS.

See below example:

(dataf
    .withColumn("geom", mos.st_setsrid(mos.st_geomfromwkb(F.col("geometrie")), F.lit(28992)))
    .withColumn("idx", mos.grid_tessellate(F.col("geom"), F.lit(2)))
).display()

image

Could this maybe be related to the order of registering the SQL functions as UDFs relative to the setting of the custom grid?

milos-colic commented 1 year ago

@casperdamen123 Apologies for the delay in coming back to you on this.

Could you confirm the coordinates for the geometries that return chips that are not empty are lager than 180/90. It may happen that near the origin the values are valid WRT H3 domain.

For other CRSs please use Custom Grid instead of H3, H3 is only intended for 4326, in other CRSs it will only produce chips for the part that would fit in the -180/180 and -90/90 bounding box.

Custom grid docs: https://databrickslabs.github.io/mosaic/api/spatial-indexing.html

CICDamen commented 1 year ago

@milos-colic Thanks for coming back on this issue!

Actually, when using the SQL bindings, all chips that are returned are empty.

As mentioned in one of my earlier comments, we are using the custom grid below: spark.databricks.labs.mosaic.index.system CUSTOM(0, 310000, 280000, 640000, 2, 2000, 2000)

Could this maybe be related to the order of registering the SQL functions as UDFs relative to the setting of the custom grid?

To be clear, the same functionality does work with the above grid setting when using the Python bindings. Would be really nice if we could leverage the tesselation SQL bindings, so that we can also store the logic in views for example.