amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
469 stars 116 forks source link

Review use of FunctionNode #121

Open etrain opened 9 years ago

etrain commented 9 years ago

In several pipelines, we use FunctionNode to handle cases where, for example, an Estimator[A,B] doesn't return a Transformer[A,B], but instead returns a Transformer[C,D], or where there is no good meaning for a single-item transformation.

Currently, FunctionNode feels like a "catch-all" because the Transformer/Estimator APIs don't sufficiently cover some of the data transformation operations we need to support.

One example of this is NGramsCounts which takes a Seq[Seq[T]] and returns a model of type NGrams[T] => Int.

Other examples include Windower and Sampler which are used in the RandomPatchCifar pipeline. These nodes are different in that they do not operate on single items and are thus not transformers, but act as something like an Aggregator if we were going to draw a database analogy.

tomerk commented 9 years ago

What percent of these are only being used in the "fitting" part of a pipeline and not the "prediction" part?

tomerk commented 9 years ago

And are these all RDD to RDD?

concretevitamin commented 9 years ago

Just randomly chiming in - the aggregation pattern is everywhere in every query processing engine (and you're totally right, it's also in decade-old databases!), so I guess there's a reason.

tomerk commented 9 years ago

So after taking a closer look, it seems to me like the cases we're using FunctionNode right now fall under either:

Some questions I have about the Aggregators are: