amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
468 stars 116 forks source link

Kernel Generators and Solver #284

Closed shivaram closed 8 years ago

shivaram commented 8 years ago

This PR adds kernel generators, a ridge regression solver and a kernel block model to Keystone. It also includes a CIFAR pipeline that shows how it can be used.

At a high level the user-facing design is as follows:

Internally this works as follows:

This was originally developed with the help of @stephentu @rolloff and most of the code was ported from #234 written by @Vaishaal.

shivaram commented 8 years ago

Note that while the code compiles and the unit test passes, I still haven't run the CIFAR pipeline on a cluster. I'll do that later today / tomm but I wanted to get the design feedback going before that. I also plan to add a CIFAR augmented kernel pipeline, so we can match the numbers in http://arxiv.org/abs/1602.05310

Vaishaal commented 8 years ago

Are you going to add the unit tests from the previous PR?

shivaram commented 8 years ago

Ah sorry forgot to commit it - I just copied the unit test you wrote. It passes locally

shivaram commented 8 years ago

Also one more note is that the kernel classes are templatized so we can handle other inputs. For example the Yelp workload used a SparseVector and a linear kernel and we should be able to handle that directly now

Vaishaal commented 8 years ago

Couple high level things:

Would it be easy to fit Nystroem into this framework?

shivaram commented 8 years ago

@Vaishaal

  1. Good point. I added a comment in the class documentation for KRR that we solve this in the dual.
  2. The RBF definition we are using seems pretty standard from what I see elsewhere http://scikit-learn.org/stable/modules/metrics.html#rbf-kernel - Is Figure 1 from http://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf the best definition of the mapping ? We can make a note of this somewhere in the docs.
  3. I think Nystrom solver should fit pretty well in this framework. It can still take in a KernelMatrix as the input and then just sample the necessary column blocks. The main problem is that we might need to write another model class that does the same sampling on the test blocks as well. I'll try to see if we can make this more general though
Vaishaal commented 8 years ago

@stephentu had some comments about this. I think its weird cause the Nystroem is solved in the primal IIRC? I might be wrong..

And yes that figure is the best mapping.

shivaram commented 8 years ago

Yeah Nystrom is solved in the primal. But the KernelMatrix, which is this lazy intermediate datastructure, should remain the same across primal or dual solvers.

Also ideally we should be able to use the same block linear mapper that we use for RandomCosines but since the kernel transformer is not a keystone transformer we need this other class -- and this is something we can try to generalize better.

shivaram commented 8 years ago

@etrain This is worth another look when you get a chance. From testing it on a single machine, I can get the same test error (around 20%) as the older code for CIFAR unaugmented. In terms of performance I've made a bunch of changes that brings the performance pretty close to what we had before.

However the kernel matrix API is now a bit more tricky to use (users need to call unpersist after they are done with a block). I think this is a reasonable trade-off given that these classes are internal to keystone and the user-facing API is still straightforward to use. But any ideas to improve the API are welcome.

etrain commented 8 years ago

This looks pretty good to me. I don't see any major changes I'd make. The state management stuff is annoying but manageable. We could think of a standard interface that lets things clean up after themselves. This KernelGenerator stuff is a weird corner case, though, and I don't know if this is the right place to start with it.

etrain commented 8 years ago

Alright - finally merging this! Thanks @shivaram @Vaishaal @stephentu @rolloff - this is a great new feature.