Closed robertnagy1 closed 11 months ago
Regular inner join. It does not optimize regular left join, cross join, etc. Broadcast join. All types of broadcast joins: inner, left, cross, etc.
Sedona's inner join algorithm is designed to mitigate the impact of spatial data skewness. However, its performance is affected by the number of partitions and the dominant side of the join.
(1) try to increase the number of partitions in your two input DataFrame. For example, df = df.repartition(1000) (2) Try to switch the sides of spatial joins, this might improve the join performance
The spatial partitioning grids (which directly affects the load balance of the workloads) should be built on the larger dataset in a spatial join. We call this dataset the dominant dataset.
In Sedona 1.3.1-incubating and earlier versions:
dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Intersects(df1.geom, df2.geom)
dominant dataset is df2: SELECT * FROM df1, df2 WHERE ST_Intersects(df2.geom, df1.geom)
In Sedona 1.4.0 and later:
dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Intersects(df1.geom, df2.geom)
dominant dataset is df2: SELECT * FROM df2, df1 WHERE ST_Intersects(df2.geom, df1.geom)
In Sedona 1.3.1-incubating and earlier versions:
dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)
dominant dataset is df2: SELECT * FROM df1, df2 WHERE ST_CoveredBy(df2.geom, df1.geom)
In Sedona 1.4.0 and later:
dominant dataset is df1: SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)
dominant dataset is df2: SELECT * FROM df2, df1 WHERE ST_Contains(df1.geom, df2.geom)
Hi,
I see that in the PR [SEDONA-177] spatial predicates were implemented on the RDD level. I cannot somehow find this in the python libraries. Are they implemented for joins on SRDDs?
I am trying to find overlapping polygons within the same shapefile which has about 4 million features. What i would normally do is either:
First query runs for 15 minutes, second query doesn't run at all cause it is correlated query, and it is not allowed in Spark. So i wonder how much less time would it take to check the overlaps directly on SRDDs, rather than running SQL query on dataframes? I find that this take particularly long time given that i run it on 8 workers with 4 cores each.
Do saving the files in a different format like delta lake, or geoparquet speed anything up in particular?