amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
470 stars 117 forks source link

Optimization should happen automatically #170

Closed etrain closed 9 years ago

etrain commented 9 years ago

Users shouldn't have to worry about optimizing their pipelines. The system should handle this automatically. It would be good to have Optimizer.execute(pipeline) happen automatically as a first step of pipeline.apply(inputRdd) - or something.

Andy major reasons why this can't work?

tomerk commented 9 years ago

We'll need to prevent an infinite recursion (but that shouldn't be too hard).

What is a bigger concern with this is we want optimization to only happen once, as it may take a bit of time. But, our pipelines are currently immutable so this would happen every time you try applying it to any data.

etrain commented 9 years ago

I'm quoting you on the first one when you become a rich and famous computer scientist. ;)

On the second part - i'm not terribly worried about this. We can build a catalog of pipelines and store them (and their optimized counterparts) somewhere, or we can add optimized pipeline as internal state (an unassigned var in the constructor). Important details but ones we can figure out. For now, optimization is a short procedure and one that we can afford to "waste" time doing multiple times.

The important part here is we want a clean separation of user-level concerns and system-level concerns.

tomerk commented 9 years ago

Do you want this to happen both for single-item "apply" and bulk-rdd "apply"'s?

etrain commented 9 years ago

Probably both.. What if the implementation is just this?

lazy val fittedPipe = Optimizer.execute(pipe)

def apply(stuff: RDD[X]) = fittedPipe.apply(stuff)
def apply(stuff: X) = fittedPipe.apply(stuff)

Or does it have to be more complicated than that?

tomerk commented 9 years ago

I'm pretty sure you'll still end up with the infinite recursion if you do that. What I'm going to do is just call the internal execution functions of the optimized pipeline directly, as opposed to it's apply method.