amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
470 stars 117 forks source link

Should I switch from spark.ml.pipeline to kestone for computer vision task ? #151

Closed jrabary closed 9 years ago

jrabary commented 9 years ago

Dear all,

As working in the field of large scale computer vision, I find keystoneML very interesting. I began to implement myself a computer vision task pipeline based on the ampcamp 5 pipeline demo but using the API of spark.ml. I have now some computer vision feature extractor that are subclass of spark.ml.Transformer and use spark.ml.Pipeline to solve my problems. But, in keystone, there are much more interesting feature than I have and I'm thinking to switch to keystone in order to don't reinvent the wheel. So my question is : are these two project will be merged one day and what the main difference between keystone and spark.ml ? With spark.ml.Pipeline one works directly with DataFrame. With keystone, the pipeline transforms RDDs. Is it more efficient ?

Jao

etrain commented 9 years ago

Thanks for stopping by @jrabary. This is probably a discussion better had on our users mailing list (keystoneml-users@groups.google.com).

In fact, a question like this one already came up and was addressed there: https://groups.google.com/forum/#!topic/keystoneml-users/HHaYgkJlDSM

There are no concrete plans to merge the two projects right now - KeystoneML is a research project, but we're very interested in making sure we can support common workloads. Ideas and/or code from KeystoneML may make their way into spark.ml over time.

In terms of efficiency - it is tough to say whether DataFrame is more or less efficient than native RDD operations. This will depend on a number of factors, including the native data representation under the covers and whether a pipeline can be optimized by the SparkSQL optimizer. We feel that the RDD-oriented transformations provide the ability to make bulk operations more efficient, and the type-safety provides compile-time assurances about whether a pipeline will actually run or not. That said - this is pretty hand-wavy and it would make more sense to talk about a concrete example if you have one.