amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
470 stars 117 forks source link

Kernel solver #234

Closed Vaishaal closed 8 years ago

Vaishaal commented 8 years ago

@shivaram let me know if this looks correct. I'm storing the full model in memory and broadcasting, do you know of a more effective way to do this.

shivaram commented 8 years ago

cc @stephentu @rolloff

etrain commented 8 years ago

One thing i would like to try and get right here is interfaces. E.g. The sparseLinearKernel is of type (Array[Long], Array[Double]) when probably it should be of type SparseVector? I am not sure we are (ever) going to be running 2e9 by 2e9 kernel solves, so maybe this is OK? Also, is the right return value for, e.g. generateKernelTestBlock actually RDD[DenseVector[Double] or is it a RowPartitionedMatrix? In the only use of that function in this pull request, the first thing we do is convert each partition to a DenseMatrix, which could probably be handled locally more efficiently.

@shivaram brings up a point about whether the KernelGenerator should be a Transformer. It would be good to figure out how to fit this into the Estimator/Transformer framework. The fact that there's a separate method for Train/Test indicates to me that maybe we should have one method that takes Train and is an Estimator that when fit returns a Transformer that can be applied to test. I haven't thought through the workflow of these things as closely as you guys, but hopefully that provides some guidance. Taking a pass over the rest of the code now.

etrain commented 8 years ago

Alright - I took a pass with a bunch of style/structure comments. Before you do a ton of coding it might make sense to just sketch out the interface changes you're thinking about and we can talk about them here?

Also, from a maintainability perspective, it would be really good to get a reference to a book chapter or paper that describes exactly the problem you're solving in the scaladoc and then be consistent with those variable names, etc. For example, I looked at 7.4 of Murphy's book to page this into my brain this AM. For the main algorithm - I guess the BCD kernel paper is on arXiv now? http://arxiv.org/pdf/1602.05310v1.pdf

etrain commented 8 years ago

Ok - for right now I'm just focused on the KernelMatrix API/implementation because I think the other changes we talked about yesterday are still to come. Modulo my comment about zipWithIndex and some style nits I am basically cool with this.

Vaishaal commented 8 years ago

If this looks good I'll start plumbing the linear system stuff and then we should be done?

Vaishaal commented 8 years ago

ping

shivaram commented 8 years ago

Sorry for the delay - I'll get to this after May 20

Vaishaal commented 8 years ago

sounds good!

On Mon, May 16, 2016 at 12:02 PM, Shivaram Venkataraman < notifications@github.com> wrote:

Sorry for the delay - I'll get to this after May 20

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/amplab/keystone/pull/234#issuecomment-219515123

etrain commented 8 years ago

Alright - closing this since we got this functionality in #284.