apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.96k stars 695 forks source link

Spatial Predicates in Sedona Python #886

Closed robertnagy1 closed 11 months ago

robertnagy1 commented 1 year ago

Hi,

I see that in the PR [SEDONA-177] spatial predicates were implemented on the RDD level. I cannot somehow find this in the python libraries. Are they implemented for joins on SRDDs?

I am trying to find overlapping polygons within the same shapefile which has about 4 million features. What i would normally do is either:

First query runs for 15 minutes, second query doesn't run at all cause it is correlated query, and it is not allowed in Spark. So i wonder how much less time would it take to check the overlaps directly on SRDDs, rather than running SQL query on dataframes? I find that this take particularly long time given that i run it on 8 workers with 4 cores each.

Do saving the files in a different format like delta lake, or geoparquet speed anything up in particular?

jiayuasu commented 1 year ago
  1. Saving results to GeoParquet / DeltaLake won't speed up the join speed.
  2. RDD based spatial join is available in Sedona Python: https://sedona.apache.org/1.4.1/tutorial/rdd/#write-a-spatial-join-query
  3. I would still recommend SQL based join. But you might want to tune its performance. See below:

Sedona optimizes two types of spatial joins.

Regular inner join. It does not optimize regular left join, cross join, etc. Broadcast join. All types of broadcast joins: inner, left, cross, etc.

Sedona's inner join algorithm is designed to mitigate the impact of spatial data skewness. However, its performance is affected by the number of partitions and the dominant side of the join.

Improve Sedona's performance if it is slow:

(1) try to increase the number of partitions in your two input DataFrame. For example, df = df.repartition(1000) (2) Try to switch the sides of spatial joins, this might improve the join performance

Rule of thumb:

The spatial partitioning grids (which directly affects the load balance of the workloads) should be built on the larger dataset in a spatial join. We call this dataset the dominant dataset.

If you use ST_Intersects:

In Sedona 1.3.1-incubating and earlier versions:

dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Intersects(df1.geom, df2.geom)

dominant dataset is df2: SELECT * FROM df1, df2 WHERE ST_Intersects(df2.geom, df1.geom)

In Sedona 1.4.0 and later:

dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Intersects(df1.geom, df2.geom)

dominant dataset is df2: SELECT * FROM df2, df1 WHERE ST_Intersects(df2.geom, df1.geom)

If you use ST_Contains:

In Sedona 1.3.1-incubating and earlier versions:

dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)

dominant dataset is df2: SELECT * FROM df1, df2 WHERE ST_CoveredBy(df2.geom, df1.geom)

In Sedona 1.4.0 and later:

dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)

dominant dataset is df2: SELECT * FROM df2, df1 WHERE ST_Contains(df1.geom, df2.geom)