astrolabsoftware / spark3D

Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
https://astrolabsoftware.github.io/spark3D/
Apache License 2.0
30 stars 16 forks source link

On the partitioning: Octree partitioning does not keep all elements #75

Closed JulienPeloton closed 6 years ago

JulienPeloton commented 6 years ago

OS: CentOS Linux release 7.4.1708 (Core) spark3D: 0.1.4 spark-fits: 0.6.0

72 adds a script to benchmark the partitioning. The idea is the following:

1) Load data using spark-fits (10 millions) 2) Apply partitioning or not to the RDD 3) Trigger an action, and repeat this several times (put in cache data at the first time)

Just printing the number of elements of the repartitioned RDD:

    // Load the data
    val options = Map("hdu" -> hdu)
    val pRDD = new Point3DRDD(spark, fn_fits, columns, isSpherical, "fits", options)

    // Partition it
    val rdd = mode match {
        case "nopart" => pRDD.rawRDD.cache()
        case "octree" => pRDD.spatialPartitioning(GridType.OCTREE).cache()
        case "onion" => pRDD.spatialPartitioning(GridType.LINEARONIONGRID).cache()
        case _ => throw new AssertionError("Choose between nopart, onion, or octree for the partitioning.")
    }

    // MC it to minimize flukes
    for (i <- 0 to 2) {
      val number = rdd.count()
      println(s"Number of points ($mode) : $number")
    }

I obtain:

Number of points (nopart) : 10000000
Number of points (octree) : 9999995
Number of points (onion) : 10000000

Weird?

mayurdb commented 6 years ago

Created #76 for resolving this issue

JulienPeloton commented 6 years ago

Checked!

Number of points (octree) : 10000000