apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.86k stars 655 forks source link

Save / Load indexed spatial & partitioned Rdd #1213

Open vbmacher opened 7 months ago

vbmacher commented 7 months ago

Expected behavior

Maybe this is possible somehow, but I haven't find this anywhere. I'm relatively new to Sedona and Geo-processing. I'd like to see a possibility to save and then load a spatial RDD which is already analyzed, partitioned and possibly with the index. We have a use case we use such dataset in many jobs (which use the same spatial data) and it's time-consuming to create the partitioning & build index every time. Not sure if it's possible though.

For example:

// save once:
val spatialRdd = Adapter.toSpatialRdd(df, ...)
spatialRdd.analyze()
spatialRdd.spatialPartitioning(GridType.KDBTREE, math.min(Integer.MAX_VALUE, df.count() / 2).toInt) // IllegalArgumentException: [Sedona] Number of partitions cannot be larger than half of total records num 
spatialRdd.buildIndex(IndexType.RTREE, true)
SomeSedonaUtility.saveSpatialRdd(spatialRdd, path) // <-- save with index and partitioned

// load & use multiple times:
val rdd = SomeSedonaUtility.loadSpatialRdd(path)

// and usage:
val otherRdd = Adapter.toSpatialRdd(otherDs, ...)
otherRdd.spatialPartitioning(rdd.getPartitioner)

val useIndex = true
val considerBoundaryIntersection = SpatialPredicate.COVERS
val params = new JoinQuery.JoinParams(useIndex, considerBoundaryIntersection, IndexType.RTREE, JoinBuildSide.LEFT)

val joined = JoinQuery.spatialJoin(rdd, otherRdd, params)

Actual behavior

Index & partitioning must be set at runtime (to my knowledge).

Steps to reproduce the problem

The feature is missing, so it's not possible to reproduce it.

Settings

Sedona version = 1.5.1

Apache Spark version = 3.5

API type = Scala

Scala version = 2.12

JRE version = 1.8

Environment = EMR

jiayuasu commented 7 months ago

@vbmacher Unfortunately, a spatial partitioned RDD cannot be saved and loaded back because it will lead to wrong results. See the explanation here: https://sedona.apache.org/1.5.1/tutorial/rdd/#save-an-spatialrdd-spatialpartitioned-wo-indexed

vbmacher commented 7 months ago

Thanks @jiayuasu, so I read there also it is possible to save indexed rdd (https://sedona.apache.org/1.5.1/tutorial/rdd/#save-an-spatialrdd-indexed), but to my knowledge building an index requires spatial partitioning. So when I save the indexed RDD and then reload it back, there won't be partitioning set up but index will work ?

Also I'd like to know more details on this one, if possible:

We are working on some solutions. Stay tuned!

Is it something which we can expect maybe next release? Thanks!