Internal dag update - Githubissues

amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.

http://keystone-ml.org/

Apache License 2.0

470 stars 117 forks source link

Internal dag update #270

Closed tomerk closed 8 years ago

tomerk commented 8 years ago

@etrain @shivaram See the google doc I linked in the keystone-dev slack channel for extra context.

Over 63% of this PR is unit tests, and documentation is much of the remainder.

Some feedback about naming I'd like is: What's a better package name than internal, or is that a good enough package name? Also, what's a better alternative name for Output? It's a little weird to me to be passing around Seq[Output] as the inputs to Operators and their methods.

etrain commented 8 years ago

Will you make the google doc public and post a link here! Thanks!

tomerk commented 8 years ago

https://docs.google.com/document/d/1Px7bwqxS6OA_vaB4yjzmhrxXaLErxa5qMmD80CwEXLg/edit?usp=sharing

etrain commented 8 years ago

All told this is a pretty great first cut at rewriting the internals and I think should set a nice foundation for optimization, etc.

Perhaps @shivaram can take a brief look at this with a copy of the doc above open and see if he agrees with the direction.

etrain commented 8 years ago

Re: naming - is Expression reasonable? Also, internal is no good, maybe workflow.graph for now?

etrain commented 8 years ago

And we should probably mark the whole thing as package protected or something :)

tomerk commented 8 years ago

Okay, I named it workflow.graph for now. And sure, I could try renaming Output to Expression.

shivaram commented 8 years ago

@tomerk really nice work with the design doc. I just left a few comments inline for things that could use some clarification. I haven't taken a detailed look at the logic of each of the methods used, but I like the API design and if we have test cases that go through all the code paths, I'd be fine with merging this.

etrain commented 8 years ago

I ran scoverage (https://github.com/scoverage/sbt-scoverage) against this PR and it looks like the unit tests basically caught everything but

The Operator.label method isn't really called anywhere.
Graph.nextNodeIds isn't ever called on an empty graph - is this a case we would run into?
There are a bunch of checks in connectGraph and replaceNodes that we don’t ever feed in invalid input.

I don't particularly care about 1, but we should probably add some checks for 2 (if necessary) and 3.

Otherwise this is awesome!

tomerk commented 8 years ago

Okay, just added tests for cases 2 and 3!