apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.86k stars 655 forks source link

Complex polygon shapefile #414

Closed pedromorfeu closed 1 year ago

pedromorfeu commented 4 years ago

We are performing a join query with a shapefile but it's taking too long. It's funny because the shapefile doesn't have many polygons, however the polygons are complex.

Characteristics of the shapefile:

Sample polygons: image

Execution characteristics:

Questions:

GeoSpark version = 1.2.0 Apache Spark version = 2.3.0 JRE version = 1.8 API type = Scala

geoHeil commented 4 years ago

can you broadcast the shapefile? If it is small compared to the total size of the data, it might well be worth it.

pedromorfeu commented 4 years ago

@geoHeil the shapefile is too big (2.8GB) for broadcast.

geoHeil commented 4 years ago

@pedromorfeu well ... this depends on the size of the joined data if the other side is 1TB and you have 3GB of shapes ... it is fairly sensible to have executors with let's say 20GB ram each and 5 threads(cores) and then perform a map-side join like https://github.com/complexity-science-hub/distributed-POI-enrichment, in particular, https://github.com/complexity-science-hub/distributed-POI-enrichment/blob/02d1bb2d375b20a360c7dd30b444a3acebe8f563/benchmark-spark/src/main/scala/at/csh/geoheil/poi/PoiEnrichmentBenchmarkService.scala#L257

besides that: you could simplify the geometries or potentially use a discrete global grid - like H3 and then perform an equijoin after filling all the polygons with H3 hexagons of a certain size.

pedromorfeu commented 4 years ago

As I'm experiencing data skew (one executor with all the work by the end of the processing), is it possible to improve partitioning somehow? I'm already passing a bigger number of partitions (say, 1000) to spatialPartitioning:

tileSpatialRDD.spatialPartitioning(GridType.KDBTREE, partitions)

On the other hand, is it true that spatial operations over complex polygons are more CPU-intensive?

geoHeil commented 4 years ago

Yes, this might be true - depending on what you want to compute. You could consider:

whereas the last one is quickest to try if these complex polygons overlap a lot you might only achieve what you want with one of the other methodologies with good performance. Also do not forget to experiment with various partitioning and indexing possibilities - a kdBtree for the partitioning with seems to be quite good in general.

pedromorfeu commented 4 years ago

10x to 100x increase parallelism (repartition) and have geospark try to handle it

So, you recommend repartition prior to spatialPartitioning. Is it correct?

Which join side should be repartitioned? Left, right, both, the shapefile itself?

geoHeil commented 4 years ago

This depends. You want >> partitions than you have now. Usually, it makes sense to partition the larger side of the join for increased parallelism.

pedromorfeu commented 4 years ago

Does it affect the performance the fact that the shapefile is of type multiplygon? May it be improved by flattening the shapefile in its polygons?

pedromorfeu commented 4 years ago

@jiayuasu, can you please have your say on this one? Thanks.

jiayuasu commented 4 years ago

@pedromorfeu

Do your polygons overlap with each other a lot? Does the other dataset have lots of overlaps?

The key idea of spatial data partitioning is to partition spatial data by their proximity, so it will speed up the query processing. If the polygons have a lot of overlaps, the performance of spatial join is surely slow.

How can we solve this? Unfortunately, no good solution right now.

But a very naive solution could be, use Spark to randomly cut you polygon dataset to X Spatial RDDs. Then join them with the left dataset, X times, and merge the result together. X can be > 3.

pedromorfeu commented 4 years ago

The polygons don't overlap, they just happen to be complex.

Here's an example: image

I understand the spatial partitioning benefits. But in this scenario, it's causing skewness.

Thanks for the naive solution, I'll give it a try.

pedromorfeu commented 4 years ago

@jiayuasu, it tried the naive way. I have the sense the the application runs more smoothly now. However, the performance improvement isn't much.

I just noticed yesterday that there's a polygon of 1 million coordinates. Also, in a local profiling session, I see a lot of time spent in MonotoneChain#computeIntersections and SimpleMCSweepLineIntersector#computeIntersections. Does it give an hint on what's happening?

I'm starting to thing in alternatives. One is the rasterization of the shapefile to a grid of 20m x 20m, which matches ours. Does GeoSpark provide utilities for doing this?

pedromorfeu commented 4 years ago

@jiayuasu, please comment.

jiayuasu commented 4 years ago

@pedromorfeu Currently, GeoSpark does not provide a function to rasterize the shapefile to grids. "a polygon of 1 million coordinates" is really an issue that cannot be solved directly by GeoSpark. It just takes time.

netanel246 commented 4 years ago

@pedromorfeu, Maybe you can use GeoSpark Simplify Preserve Topology function to reduce the number of vertices

Good explanation: ST_SimplifyPreserveTopology

pedromorfeu commented 4 years ago

@netanel246, thanks for your suggestion. It's a very useful feature that I didn't know was implemented in GeoSpark.

However, we cannot afford losing precision in our polygons.