astrolabsoftware / spark3D

Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
https://astrolabsoftware.github.io/spark3D/
Apache License 2.0
30 stars 16 forks source link

Integer overflows for kNN search? #101

Open JulienPeloton opened 5 years ago

JulienPeloton commented 5 years ago

pyspark3d issue.

kNN search for data set size > 2G elements seem to go crazy :D I was running kNN for k=1000, and data set size = 5,000,000,000 elements.

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.astrolabsoftware.spark3d.spatialOperator.SpatialQuery.KNN.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 556 in stage 0.0 failed 4
times, most recent failure: Lost task 556.3 in stage 0.0 (TID 967, 134.158.75.162, executor 3): 
java.lang.IllegalArgumentException: Comparison method violates its general contract!
    at java.util.TimSort.mergeHi(TimSort.java:899)
    at java.util.TimSort.mergeAt(TimSort.java:516)
    at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
    at java.util.TimSort.sort(TimSort.java:254)
    at java.util.Arrays.sort(Arrays.java:1512)
    at com.google.common.collect.Ordering.leastOf(Ordering.java:708)
    at com.astrolabsoftware.spark3d.utils.Utils$.com$astrolabsoftware$spark3d$utils$Utils$$takeOrdered(Utils.scala:174)
    at com.astrolabsoftware.spark3d.utils.Utils$$anonfun$1.apply(Utils.scala:154)
    at com.astrolabsoftware.spark3d.utils.Utils$$anonfun$1.apply(Utils.scala:152)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Note, the same problem appears regardless we want distinct objects or not. My best guess is that we would to trade integer for long.

ADDED: Interesting to note though, this does not happen in the pure Scala version. The difference which comes into my mind is the default level of storage for RDD (None in Scala, MEMORY_ONLY in python).