Open jwass opened 8 months ago
@jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join.
Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result.
@jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join.
Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result.
@jiayuasu What I really want to do is write out a large geoparquet dataset where the individual parquet files are spatially partitioned intelligently. This will improve performance of remote spatial queries by bounding box. We have some solutions now to split by geohash/quadkey, but a partitioning scheme backed by a kdb-tree / r-tree / etc would be better. The fact that polygons' extents will cause overlaps of the spatial partitions is fine but we do need to assign each row to only one partition. I was hoping there was a way to use df.repartition
with the spatial rdd's partitioner to make it all work. But let me know if this is not the right use for this.
Is there a way to spatially partition a dataframe and write it out using that partitioning scheme (presumably by converting to/from a spatial rdd)? This is my guess as to how to accomplish this but I'm not sure if I'm misunderstanding things... I'm also relatively new to working with Spark and Sedona.
Expected behavior
Loading a dataframe, converting to rdd, spatially partition it, convert back to dataframe, and save the result - I'd expect the final dataframe partitioning to be preserved from the rdd.
Actual behavior
Adapter.toDf() does not preserve partitioning - or I'm doing something else wrong.
Steps to reproduce the problem
But it looked like that doesn't work - number of partitions written in df2 was far greater than 6.
Settings
Sedona version = 1.5.1
Apache Spark version = ?
API type = Python
Python version = ?
Environment = Databricks