Closed dars1608 closed 4 years ago
@dars1608 Hello, most likely, your cluster has some issues. GeoSpark does not change the scheduling module in Spark.
@jiayuasu Isn't the job scheduling dependent on the partitioning of the data?
@dars1608 Nope. They do not have direct connection. I would suggest that you open the "stderr" on your history server to check what the actual error it. There might be some other errors that lead to the "dead" workers.
You were right, the problem (partially) occured due the some of the spark-submit configurations. Maybe there's also some bad values in the YARN configuration as well.
Thank you!
Expected behavior
Spark should be using all of the given resources, but only runs on 2 workers.
Actual behavior
Screenshot from History Server Web UI: http://deviantpics.com/image/problem.08E
Steps to reproduce the problem
Code snippet:
subscriptionsRDD
represents a dataset consisted of 1000 polygons.inputRDD
is a batch of point data, typically around 35000 points per batch. I tried to use KDBTree, QuadTree, RTree and Voronoi diagram for partitioning (this particular case was using KDBTree with QuadTree index. I tried to manually set the number of partitions, but it didn't work. I also tried to change the dominant partitioning side forJoinQuery.SpatialJoinQuery
.Full code is available here: https://gitlab.com/dars1608/geospatial-index-distributed
Settings
GeoSpark version = 1.2.0
Apache Spark version = 2.4.0
JRE version = 1.8
API type = Scala