apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.96k stars 695 forks source link

Preserve Spatial Partitioning From RDD to Dataframe #1268

Open jwass opened 8 months ago

jwass commented 8 months ago

Is there a way to spatially partition a dataframe and write it out using that partitioning scheme (presumably by converting to/from a spatial rdd)? This is my guess as to how to accomplish this but I'm not sure if I'm misunderstanding things... I'm also relatively new to working with Spark and Sedona.

Expected behavior

Loading a dataframe, converting to rdd, spatially partition it, convert back to dataframe, and save the result - I'd expect the final dataframe partitioning to be preserved from the rdd.

Actual behavior

Adapter.toDf() does not preserve partitioning - or I'm doing something else wrong.

Steps to reproduce the problem

df =  sedona.read.format("geoparquet").load(path)
rdd = Adapter.toSpatialRdd(df, "geometry")
rdd.analyze()
rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=6)

df2 = Adapter.toDf(rdd, spark)
df2.write.format("geoparquet").save(output_path)

But it looked like that doesn't work - number of partitions written in df2 was far greater than 6.

Settings

Sedona version = 1.5.1

Apache Spark version = ?

API type = Python

Python version = ?

Environment = Databricks

jiayuasu commented 8 months ago

@jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join.

Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result.

jwass commented 8 months ago

@jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join.

Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result.

@jiayuasu What I really want to do is write out a large geoparquet dataset where the individual parquet files are spatially partitioned intelligently. This will improve performance of remote spatial queries by bounding box. We have some solutions now to split by geohash/quadkey, but a partitioning scheme backed by a kdb-tree / r-tree / etc would be better. The fact that polygons' extents will cause overlaps of the spatial partitions is fine but we do need to assign each row to only one partition. I was hoping there was a way to use df.repartition with the spatial rdd's partitioner to make it all work. But let me know if this is not the right use for this.