Closed etrain closed 9 years ago
We'll need to prevent an infinite recursion (but that shouldn't be too hard).
What is a bigger concern with this is we want optimization to only happen once, as it may take a bit of time. But, our pipelines are currently immutable so this would happen every time you try applying it to any data.
I'm quoting you on the first one when you become a rich and famous computer scientist. ;)
On the second part - i'm not terribly worried about this. We can build a catalog of pipelines and store them (and their optimized counterparts) somewhere, or we can add optimized pipeline as internal state (an unassigned var in the constructor). Important details but ones we can figure out. For now, optimization is a short procedure and one that we can afford to "waste" time doing multiple times.
The important part here is we want a clean separation of user-level concerns and system-level concerns.
Do you want this to happen both for single-item "apply" and bulk-rdd "apply"'s?
Probably both.. What if the implementation is just this?
lazy val fittedPipe = Optimizer.execute(pipe)
def apply(stuff: RDD[X]) = fittedPipe.apply(stuff)
def apply(stuff: X) = fittedPipe.apply(stuff)
Or does it have to be more complicated than that?
I'm pretty sure you'll still end up with the infinite recursion if you do that. What I'm going to do is just call the internal execution functions of the optimized pipeline directly, as opposed to it's apply method.
Users shouldn't have to worry about optimizing their pipelines. The system should handle this automatically. It would be good to have
Optimizer.execute(pipeline)
happen automatically as a first step ofpipeline.apply(inputRdd)
- or something.Andy major reasons why this can't work?