amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
470 stars 117 forks source link

Integrate Block Operators more neatly with the DAG #214

Open tomerk opened 8 years ago

tomerk commented 8 years ago

Currently so as to be easily chainable with the rest of the code, block operators (such as block solves and block transformers) take a single complete RDD and manually split it into multiple blocks in a way that is hidden from the DAG.

If we add some DAG rewriting rules to detect this and integrate block operators better with the DAG, we should be able to take advantage of optimizations like auto-caching more effectively, and we can allow the block operators to operate on blocks lazily.

etrain commented 8 years ago

One thing that makes the block solves tricky is that the blocks are not independent. That is - we pass a Seq[RDD[T]] because the solution to the second block depends on the solution to the first block. It is not clear to me how to capture this in the DAG.

tomerk commented 8 years ago

I think it should be able to work the same way the GatherTransformer works: a TransformerNode that takes multiple RDDs together as input.