amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
470 stars 117 forks source link

Internal dag execution #276

Closed tomerk closed 8 years ago

tomerk commented 8 years ago

Introduces user APIs that build on top of the previously introduced Graph.

etrain commented 8 years ago

I am still working my way through this PR and won't look at your responses tonight - so don't feel like you need to post them real-time.

My meta-takeaway from this so far is that

  1. There are a bunch of new classes being introduced, and I don't have a clear sense of the execution flow through these new classes - I don't know if it's a written description or a box and pointers diagram or what - but as it stands now this code feels like something that fits in your head but isn't self-explanatory. It doesn't have to be self-explanatory, of course, but it should be something you can explain in a small document or ideally on a napkin :)
  2. Let's be careful about changing our user-facing APIs too much. They're simple and that's a good thing. Changing method names that we expose to users when simply creating a new method name for the internal behavior we want to support is probably a better approach.
  3. I think there's a decent amount of confusion/complexity coming in because we say that pipelines take RDDs and emit PipelineDatasetOut. Instead if we're going to introduce yet another wrapper for a data set, we should at least have the API consistently work with these things both as input and output. I'm still not convinced we need it, but it will at least make the API more consistent and easier to understand and may solve a bunch of the naming/override problems we're seeing.