Closed tomerk closed 8 years ago
This might be a good test case for OptimizableNode
can we come up with a rule for when its better to do this vs. the old way?
Isn't the difference just that we do a treeReduce
instead of a reduce
? If so it should be possible to just configure treeReduce
to behave like reduce
when the data is small. Pasting the top
function implementation below.
val mapRDDs = mapPartitions { items =>
// Priority keeps the largest elements, so let's reverse the ordering.
val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
Iterator.single(queue)
}
mapRDDs.reduce { (queue1, queue2) =>
queue1 ++= queue2
queue1
}.toArray.sorted(ord)
Sure, I'll use that instead of the glom and sorted code.
Yeah and if you use the bounded queue we shouldn't see any perf degradation in terms of compute
Ah, spark's bounded priority queue is marked private. Should I copy the code or do something else?
According to http://stackoverflow.com/questions/7878026/is-there-a-priorityqueue-implementation-with-fixed-capacity-and-custom-comparato there is one in Guava. Not sure what our take on using guava is. (@etrain ?) Otherwise the answer suggested there or inserted into a sorted set is not bad
I like the BoundedPriorityQueue
approach. While adding a dependency on Guava for this is kind of a bummer, I'd rather that than reimplementing ourselves.
@etrain BoundedPriorityQueue
is a Spark private util, not a guava class. As for the guava one mentioned in the stack overflow post: It's not a full bounded priority queue implementation, and the comments on that answer warn that it shouldn't be used for bounded priority queues:
"We are giving some consideration to deprecating MinMaxPriorityQueue, just because it tends to get misused in this way."
Alright - what do we need? A thing that takes in an iterator over the partition and returns the top k elements by inserting/removing elements from a binary heap or something? I guess this is probably simple enough to implement ourselves but i don't want to go too deep down a rabbit hole here. Maybe there's another implementation somewhere we should be using?
bounded priority stuff pushed.
Hmm - do we not have unit tests for CommonSparseFeatures
- seems like this would be a good time to put them in to make sure that the priority queue stuff isn't breaking some edge cases. Otherwise this LGTM.
We already do, as part of the SparseFeaturesSuite
Okay, I added an explicit guava dependency. We may have to keep an eye on this as we update Spark versions.
How did you pick the version number ?
It was what all the existing transitive dependencies on Guava had resolved to.
Sounds good. LGTM
at large numbers of machines and high feature counts.
See issue #183