Spatial Predicates in Sedona Python

Hi,

I see that in the PR [SEDONA-177] spatial predicates were implemented on the RDD level. I cannot somehow find this in the python libraries. Are they implemented for joins on SRDDs?

I am trying to find overlapping polygons within the same shapefile which has about 4 million features. What i would normally do is either:

Select t.* from table as t1 join table as t2 on st_overlaps(t1.geometry, t2.geometry) where t1.id<>t2.id and then i would probably aggregate to find out how many geometries with different ID's are overlapping.
Or i would do a lateral join: Select t1.id. c.counter from table as t1 left join lateral(select count(*) as counter from table as t2 where st_overlaps(t1.geometry, t2.geometry) and (t1.id<>t2.id)) as c

First query runs for 15 minutes, second query doesn't run at all cause it is correlated query, and it is not allowed in Spark. So i wonder how much less time would it take to check the overlaps directly on SRDDs, rather than running SQL query on dataframes? I find that this take particularly long time given that i run it on 8 workers with 4 cores each.

Do saving the files in a different format like delta lake, or geoparquet speed anything up in particular?

Saving results to GeoParquet / DeltaLake won't speed up the join speed.
RDD based spatial join is available in Sedona Python: https://sedona.apache.org/1.4.1/tutorial/rdd/#write-a-spatial-join-query
I would still recommend SQL based join. But you might want to tune its performance. See below:

Sedona optimizes two types of spatial joins.

Regular inner join. It does not optimize regular left join, cross join, etc. Broadcast join. All types of broadcast joins: inner, left, cross, etc.

Sedona's inner join algorithm is designed to mitigate the impact of spatial data skewness. However, its performance is affected by the number of partitions and the dominant side of the join.

Improve Sedona's performance if it is slow:

(1) try to increase the number of partitions in your two input DataFrame. For example, df = df.repartition(1000) (2) Try to switch the sides of spatial joins, this might improve the join performance

Rule of thumb:

The spatial partitioning grids (which directly affects the load balance of the workloads) should be built on the larger dataset in a spatial join. We call this dataset the dominant dataset.

If you use ST_Intersects:

In Sedona 1.3.1-incubating and earlier versions:

dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Intersects(df1.geom, df2.geom)

dominant dataset is df2: SELECT * FROM df1, df2 WHERE ST_Intersects(df2.geom, df1.geom)

In Sedona 1.4.0 and later:

dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Intersects(df1.geom, df2.geom)

dominant dataset is df2: SELECT * FROM df2, df1 WHERE ST_Intersects(df2.geom, df1.geom)

If you use ST_Contains:

In Sedona 1.3.1-incubating and earlier versions:

dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)

dominant dataset is df2: SELECT * FROM df1, df2 WHERE ST_CoveredBy(df2.geom, df1.geom)

In Sedona 1.4.0 and later:

dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)

dominant dataset is df2: SELECT * FROM df2, df1 WHERE ST_Contains(df1.geom, df2.geom)

apache / sedona