amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
470 stars 117 forks source link

Internal dag update #270

Closed tomerk closed 8 years ago

tomerk commented 8 years ago

@etrain @shivaram See the google doc I linked in the keystone-dev slack channel for extra context.

Over 63% of this PR is unit tests, and documentation is much of the remainder.

Some feedback about naming I'd like is: What's a better package name than internal, or is that a good enough package name? Also, what's a better alternative name for Output? It's a little weird to me to be passing around Seq[Output] as the inputs to Operators and their methods.

etrain commented 8 years ago

Will you make the google doc public and post a link here! Thanks!

tomerk commented 8 years ago

https://docs.google.com/document/d/1Px7bwqxS6OA_vaB4yjzmhrxXaLErxa5qMmD80CwEXLg/edit?usp=sharing

etrain commented 8 years ago

All told this is a pretty great first cut at rewriting the internals and I think should set a nice foundation for optimization, etc.

Perhaps @shivaram can take a brief look at this with a copy of the doc above open and see if he agrees with the direction.

etrain commented 8 years ago

Re: naming - is Expression reasonable? Also, internal is no good, maybe workflow.graph for now?

etrain commented 8 years ago

And we should probably mark the whole thing as package protected or something :)

tomerk commented 8 years ago

Okay, I named it workflow.graph for now. And sure, I could try renaming Output to Expression.

shivaram commented 8 years ago

@tomerk really nice work with the design doc. I just left a few comments inline for things that could use some clarification. I haven't taken a detailed look at the logic of each of the methods used, but I like the API design and if we have test cases that go through all the code paths, I'd be fine with merging this.

etrain commented 8 years ago

I ran scoverage (https://github.com/scoverage/sbt-scoverage) against this PR and it looks like the unit tests basically caught everything but

  1. The Operator.label method isn't really called anywhere.
  2. Graph.nextNodeIds isn't ever called on an empty graph - is this a case we would run into?
  3. There are a bunch of checks in connectGraph and replaceNodes that we don’t ever feed in invalid input.

I don't particularly care about 1, but we should probably add some checks for 2 (if necessary) and 3.

Otherwise this is awesome!

tomerk commented 8 years ago

Okay, just added tests for cases 2 and 3!