Open rvo1994 opened 1 year ago
@rvo1994 For a point in poly join like this one, you might want to try using a grid index to speed the process up.
Take a look at the quickstart notebook to see how you can index your points and polygons using H3 then perform the join directly on the index for points within the so called 'core' cell ids (those wholly contained within the polygons) or with st_contains for cell ids that sit on the boundary of the polys.
I am trying to inner join delta table g (contains ~9M records but only 1 file of 80MB as it only consists of 2 columns: lat and lon) with delta table v (contains 40 complex polygons - biggest polygons has 2800 boundaries) on a st_contains condition (if polygon from table v contains point from table g):
I cancelled the job after 15 hours as it has only read ~1.5M records from table g (as observed from the SQL / Dataframe tab in the Spark UI) as it was still busy with "Filtering files for query". There is also no parallelization happening as there was only 1 task running. FYI, I am using DBR 11.3 LTS and Standard_DS3_v2 driver/worker with 4 nodes per executor.
Any suggestions on how to optimize this query? Thanks!