apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.83k stars 654 forks source link

st_union - understanding difference with PostGIS #1514

Closed JimShady closed 2 weeks ago

JimShady commented 3 weeks ago

I am trying to move a process which I currently run in PostGIS, to within the cloud and use Apache Sedona instead. In my PostgreSQL + PostGIS database this runs:

WITH a_table AS (
SELECT st_boundary(geom) geom
FROM   emea_eu_seu.albania_prefectures_2021
UNION ALL
SELECT st_boundary(geom) geom
FROM   emea_eu_seu.albania_municipalities_2021
UNION ALL
SELECT st_boundary(geom) geom
FROM   emea_eu_seu.albania_2_digit_postcode_areas_2021
),
b_table AS (
SELECT   geom
FROM     a_table
GROUP BY geom
)
SELECT   ST_Union(geom) geom
FROM     b_table

The result is one row:

image

I have tried to run the same process using Sedona (code below) but get an error:

WITH a_table AS (
SELECT st_boundary(st_geomfromwkb(geom_wkb)) geom
FROM   hive_metastore.gfk_bronze.albania_prefectures_2021
UNION ALL
SELECT st_boundary(st_geomfromwkb(geom_wkb)) geom
FROM   hive_metastore.gfk_bronze.albania_municipalities_2021
UNION ALL
SELECT st_boundary(st_geomfromwkb(geom_wkb)) geom
FROM   hive_metastore.gfk_bronze.albania_2_digit_postcode_areas_2021
),
b_table AS (
SELECT   geom
FROM     a_table
GROUP BY geom
)
SELECT   ST_Union(geom) geom
FROM     b_table
AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "st_union(geom)" due to data type mismatch: Parameter 1 requires the "ARRAY<BINARY>" type, however "geom" has the type "BINARY".; line 16 pos 9;
'WithCTE

I am using Sedona 1.6.0 by the way. I read "Variant 2 : As of version 1.6.0, this function accepts an array of Geometry objects and returns the geometric union of all geometries in the input array." so was expecting a similar result to that which PostGIS returns. Am I misunderstanding something?

Thanks!

furqaankhan commented 3 weeks ago

I think you would want to use ST_Union_Aggr. As the one parameter option of ST_Union expected an Array of geometries. It can be confusing as PostGIS has ST_Union and ST_Union_Aggr as one function.

JimShady commented 3 weeks ago

Isn't the idea that from 1.6.0 onwards, the behavior is the same?

image

furqaankhan commented 2 weeks ago

The variant 2 of ST_Union accepts an array of geometries ( [geom1, geom2] ) in each row. What you want to achieve from my understanding is to do ST_Union of all the rows in the specified column, for that we have ST_Union_Aggr. In PostGIS when you are invoking ST_Union after group by, you are using the fourth variant not the third.

JimShady commented 2 weeks ago

Ok, I'll give that a go and report back. :-)

JimShady commented 2 weeks ago

Any ideas on how to work around this? The same query finishes in 3 seconds on my fairly modest PostgreSQL server!

%sql

WITH a_table AS (
SELECT st_boundary(st_geomfromwkb(geom_wkb)) geom
FROM   hive_metastore.gfk_bronze.albania_prefectures_2021
UNION ALL
SELECT st_boundary(st_geomfromwkb(geom_wkb)) geom
FROM   hive_metastore.gfk_bronze.albania_municipalities_2021
UNION ALL
SELECT st_boundary(st_geomfromwkb(geom_wkb)) geom
FROM   hive_metastore.gfk_bronze.albania_2_digit_postcode_areas_2021
),
b_table AS (
SELECT   geom
FROM     a_table
GROUP BY geom
)
SELECT   ST_Union_Aggr(geom) geom
FROM     b_table
java.lang.Exception: Results too large
    at com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:458)
    at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:279)
JimShady commented 2 weeks ago

This was because the browser could not display the results. Writing to file it was fine.