Review use of FunctionNode

etrain commented 9 years ago

In several pipelines, we use FunctionNode to handle cases where, for example, an Estimator[A,B] doesn't return a Transformer[A,B], but instead returns a Transformer[C,D], or where there is no good meaning for a single-item transformation.

Currently, FunctionNode feels like a "catch-all" because the Transformer/Estimator APIs don't sufficiently cover some of the data transformation operations we need to support.

One example of this is NGramsCounts which takes a Seq[Seq[T]] and returns a model of type NGrams[T] => Int.

Other examples include Windower and Sampler which are used in the RandomPatchCifar pipeline. These nodes are different in that they do not operate on single items and are thus not transformers, but act as something like an Aggregator if we were going to draw a database analogy.

tomerk commented 9 years ago

What percent of these are only being used in the "fitting" part of a pipeline and not the "prediction" part?

tomerk commented 9 years ago

And are these all RDD to RDD?

concretevitamin commented 9 years ago

Just randomly chiming in - the aggregation pattern is everywhere in every query processing engine (and you're totally right, it's also in decade-old databases!), so I guess there's a reason.

tomerk commented 9 years ago

So after taking a closer look, it seems to me like the cases we're using FunctionNode right now fall under either:

Some form of 'aggregation' (representable as any RDD transformation that isn't item to item) being done only at 'fit' time
Something related to zipping & block transformers & estimators which we still need to figure out how to do cleanly

Some questions I have about the Aggregators are:

Do we want to be able to chain these with transformers? (judging by how they're being used right now, it looks like there's at least some interest in it)
Where do we want to call these aggregators?
Only internally within Estimators?
Directly on the training data before we call estimator.fit(data)?
Somehow chain it within a pipeline but have it only apply in the 'fitting' stage?

amplab / keystone

Review use of FunctionNode #121