Closed mayurdb closed 6 years ago
This will just ensure that the maximum number of partitions created is less than or equal to the one specified by the user. In the code, currently we have,
// taking 20% of the data as a sample
val sampleSize = (rawRDD.count * 0.2).asInstanceOf[Int]
val samples = rawRDD.takeSample(false, sampleSize, 12).toList.map(x => x.getEnvelope)
val maxLevels = floor(log(numPartitions)/log(8))
val maxItemsPerBox = ceil(sampleSize/pow(8, maxLevels))
val octree = new Octree(getDataEnvelope, 0, maxItemsPerBox, maxLevels)
cc @JulienPeloton
OK sounds good -- I will have a closer look tomorrow.
A few comments on this just to understand.
val samples = rawRDD.takeSample(false, sampleSize, 12).toList.map(x => x.getEnvelope)
The number 12 corresponds to the seed. I understand its value doesn't matter much, but why should it be hardcoded? Maybe it would be more reasonable to make an optional argument, with default value? Or let the user be able to set it afterwards?
val maxLevels = floor(log(numPartitions)/log(8))
According to your handwritten note, the maxLevels should be lower or equal to log(numPartitions)/log(8). Here you make it equal to. Is this somehow slightly dangerous if the data is very skewed (you would make many many partitions in the case of a handful of isolated points with respect to others)? I do not know how to deal exactly with this extreme case, but we should keep that in mind for the future, and maybe make a test on simulated data to see the loss.
val maxItemsPerBox = ceil(sampleSize/pow(8, maxLevels))
Also, would be good to make a condition such that maxItemsPerBox
never exceeds Int.MaxValue
. This is highly unlikely, but we never know what users will use as data ;-)
Addressed in #56