made commonsparsefeatures more scalable...

amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.

http://keystone-ml.org/

Apache License 2.0

470 stars 117 forks source link

made commonsparsefeatures more scalable... #225

Closed tomerk closed 8 years ago

tomerk commented 8 years ago

at large numbers of machines and high feature counts.

See issue #183

etrain commented 8 years ago

This might be a good test case for OptimizableNode can we come up with a rule for when its better to do this vs. the old way?

shivaram commented 8 years ago

Isn't the difference just that we do a treeReduce instead of a reduce ? If so it should be possible to just configure treeReduce to behave like reduce when the data is small. Pasting the top function implementation below.

      val mapRDDs = mapPartitions { items =>
        // Priority keeps the largest elements, so let's reverse the ordering.
        val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
        queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
        Iterator.single(queue)
      }
      mapRDDs.reduce { (queue1, queue2) =>
          queue1 ++= queue2
          queue1
      }.toArray.sorted(ord)

tomerk commented 8 years ago

Sure, I'll use that instead of the glom and sorted code.

shivaram commented 8 years ago

Yeah and if you use the bounded queue we shouldn't see any perf degradation in terms of compute

tomerk commented 8 years ago

Ah, spark's bounded priority queue is marked private. Should I copy the code or do something else?

shivaram commented 8 years ago

According to http://stackoverflow.com/questions/7878026/is-there-a-priorityqueue-implementation-with-fixed-capacity-and-custom-comparato there is one in Guava. Not sure what our take on using guava is. (@etrain ?) Otherwise the answer suggested there or inserted into a sorted set is not bad

etrain commented 8 years ago

I like the BoundedPriorityQueue approach. While adding a dependency on Guava for this is kind of a bummer, I'd rather that than reimplementing ourselves.

tomerk commented 8 years ago

@etrain BoundedPriorityQueue is a Spark private util, not a guava class. As for the guava one mentioned in the stack overflow post: It's not a full bounded priority queue implementation, and the comments on that answer warn that it shouldn't be used for bounded priority queues: "We are giving some consideration to deprecating MinMaxPriorityQueue, just because it tends to get misused in this way."

etrain commented 8 years ago

Alright - what do we need? A thing that takes in an iterator over the partition and returns the top k elements by inserting/removing elements from a binary heap or something? I guess this is probably simple enough to implement ourselves but i don't want to go too deep down a rabbit hole here. Maybe there's another implementation somewhere we should be using?

tomerk commented 8 years ago

bounded priority stuff pushed.

etrain commented 8 years ago

Hmm - do we not have unit tests for CommonSparseFeatures - seems like this would be a good time to put them in to make sure that the priority queue stuff isn't breaking some edge cases. Otherwise this LGTM.

tomerk commented 8 years ago

We already do, as part of the SparseFeaturesSuite

tomerk commented 8 years ago

Okay, I added an explicit guava dependency. We may have to keep an eye on this as we update Spark versions.

shivaram commented 8 years ago

How did you pick the version number ?

tomerk commented 8 years ago

It was what all the existing transitive dependencies on Guava had resolved to.

shivaram commented 8 years ago

Sounds good. LGTM