Closed shivaram closed 8 years ago
Note that while the code compiles and the unit test passes, I still haven't run the CIFAR pipeline on a cluster. I'll do that later today / tomm but I wanted to get the design feedback going before that. I also plan to add a CIFAR augmented kernel pipeline, so we can match the numbers in http://arxiv.org/abs/1602.05310
Are you going to add the unit tests from the previous PR?
Ah sorry forgot to commit it - I just copied the unit test you wrote. It passes locally
Also one more note is that the kernel classes are templatized so we can handle other inputs. For example the Yelp workload used a SparseVector
and a linear kernel and we should be able to handle that directly now
Couple high level things:
Would it be easy to fit Nystroem into this framework?
@Vaishaal
KernelMatrix
as the input and then just sample the necessary column blocks. The main problem is that we might need to write another model class that does the same sampling on the test blocks as well. I'll try to see if we can make this more general though@stephentu had some comments about this. I think its weird cause the Nystroem is solved in the primal IIRC? I might be wrong..
And yes that figure is the best mapping.
Yeah Nystrom is solved in the primal. But the KernelMatrix
, which is this lazy intermediate datastructure, should remain the same across primal or dual solvers.
Also ideally we should be able to use the same block linear mapper that we use for RandomCosines but since the kernel transformer is not a keystone transformer we need this other class -- and this is something we can try to generalize better.
@etrain This is worth another look when you get a chance. From testing it on a single machine, I can get the same test error (around 20%) as the older code for CIFAR unaugmented. In terms of performance I've made a bunch of changes that brings the performance pretty close to what we had before.
However the kernel matrix API is now a bit more tricky to use (users need to call unpersist after they are done with a block). I think this is a reasonable trade-off given that these classes are internal to keystone and the user-facing API is still straightforward to use. But any ideas to improve the API are welcome.
This looks pretty good to me. I don't see any major changes I'd make. The state management stuff is annoying but manageable. We could think of a standard interface that lets things clean up after themselves. This KernelGenerator stuff is a weird corner case, though, and I don't know if this is the right place to start with it.
Alright - finally merging this! Thanks @shivaram @Vaishaal @stephentu @rolloff - this is a great new feature.
This PR adds kernel generators, a ridge regression solver and a kernel block model to Keystone. It also includes a CIFAR pipeline that shows how it can be used.
At a high level the user-facing design is as follows:
Internally this works as follows:
KernelGenerators
are fit during training to bind the trainData as one of the arguments toK(x, y)
. This step produces aKernelTransformer
. We need this as the transformer has only one argument usually and it makes the train vs. test distinction clear.KernelTransformers
can be applied to an RDD to generate aKernelMatrix
. This is a wrapper class lazily populates the kernel matrix and has a block-column API.KernelMatrix
is used by a linear system solver that just solvesKx = Y
. Right now this is a part ofKernelRidgeRegression
class but it can be pulled out.This was originally developed with the help of @stephentu @rolloff and most of the code was ported from #234 written by @Vaishaal.