Use a more general framework for primal SVM

BigCrunsh commented 10 years ago

I only had a quick look to your code. It looks like you are implementing a specific algorithm for the primal solution of the SVM. In general, the primal solution for SVMs is an instance of, what I would call, "Regularized empirical risk minimization". Depending on the loss function and the regularizer, it also captures logistic regression and linear regression. There is also a generalization for the dual solution. Does it make sense to build a solver for these general type and then inherited from this structure?

johnmyleswhite commented 10 years ago

Regularized empirical risk minimization describes a lot of supervised learning, including many models outside of the linear predictor-based models you mentioned. I'm a little unsure what the abstract solver would do. What did you have in mind?

BigCrunsh commented 10 years ago

Well, I mean Pegasos is only an instance of a stochastic gradient descent method (aka Robbins-Monro algorithm). This is applicable for multiple objectives that can be seen as an (empirical) expectation---meaning a sum over loss functions (as these mentioned above). So, I would expect an implementation of the Robbins-Monro algorithm (in JuliaOpt?) that optimises the objective using sampled partial gradients. Based on that, an reg. empirical risk minimisation (ERM) wrapper could be initialised with the data, the loss function, the regulariser (and there gradients), and a solver, which could be, e.g., stochastic gradient descent, gradient descent with line search, or another solver (perhaps restricted to convex and differentiable functions). Then, an SVM simply instantiates ERM with an l2 regularisation and the hinge-loss. To leverage the specific form further, one could think also of a general dual formulation. What do you think? I assume that such a framework would make it perhaps easier to implement further supervised learning algorithms in a consistent way (LogReg, Lasso, linear regression,... for two or multiple classes).

johnmyleswhite commented 9 years ago

If you're still interested in a general SGD tool, I think it could be quite useful. I've done a little work on this in the past, but it's very difficult to get right in general: you need a lot of indirection to handle all possible cases. At some point, I'll put my most recent attempt online.

BigCrunsh commented 9 years ago

Yes. But I have to think about it a bit more careful - hopefully next weekend.

BigCrunsh commented 9 years ago

Hi a put my attempt online. You can find it here: https://github.com/BigCrunsh/RegERMs.jl.

jumutc commented 9 years ago

I would suggest to use some other techniques for learning primal/dual SVMs as well, like Modified Frank-Wolfe algorithm etc. In our lab we are working on this as well, so it would be interesting to contribute!

johnmyleswhite commented 9 years ago

Having other solvers would be great. Please do submit PR's.

BigCrunsh commented 9 years ago

@jumutc: Does it make sense to consider this in an even more general framework, too. The work on the regularized empirical risk framework (https://github.com/JuliaStats/RegERMs.jl) has been dormant for some time. I am wondering if we should recommence the discussion and use the SVM as a kind of kick-off. @johnmyleswhite: What do you think? Too early? Not worth it?

jumutc commented 9 years ago

Dear @BigCrunsh and @johnmyleswhite,

I have noticed several very important issues with RegERMs.jl and SVM.jl packages: 1) In regularized ERM type of learning there are other algorithms, like Regularized Dual Averaging and others, which would be difficult to fit inside because of a very different from SGD or L-BFGS optimization technique and existence of an additional pool of hyperparameters. For instance optimize method is tailored towards only one hyperparameter λ. 2) I do not see any support for cross-validation or hyperparameter tuning in RegERMs.jl and SVM.jl packages. All these issues are very crucial for the acceptance among practitioners and researches from other fields rather than Machine Learning and applied mathematics. 3) When we are working in primal, our w::Vector represents only the input space. For many ML problems it is often needed to go into kernel-induced feature space where we can find a better solution. It would be nice to use Nystrom method to approximate such feature space using kernels while still applying regularized ERM techniques.

In our lab we have implemented most of the above mentioned "features" in Julia and if you are interested we can try to merge our efforts!

BigCrunsh commented 9 years ago

@jumutc: yes definitely. Here some comments: 1) yes the optimize method is currently specific to SGD and L-BFGS. However, I think it makes sense to generalize or extend it to dual or hybrid methods. 2) SVM.jl and RegERMs.jl are the learning algorithms; RegERMs.jl is designed to be a generalized formulation of a class of optimization problems. General machine learning stuff like CV might be rather captured by MLBase.jl; we work on unifying the interfaces. 3) I started with a Mercer-map for that issue too. However, I removed it for now for clarity, but also here: I would be happy to add it!

@jumutc: Does that sound reasonable to you?

jumutc commented 9 years ago

@BigCrunsh: yes, for sure

I was mainly working on different SGD- and RDA-based approaches and there were several issues I have encountered while implementing all of them in Julia. I would like to suggest the following: 1) having SVM.jl and RegERMs.jl apart from each other doesn't make much sense, we should think of integrating them into one package 2) I can contribute in terms of l1-RDA and even l0-inducing norms to RegERMs.jl, code is maintained in our private repo, but it would be nice to contribute it! 3) we need to revise API in order to make ERM-based learning open for other approaches and optimization techniques 4) we need to include a proper support for sparse matrices (in SGD- and RDA-based solvers) to avoid computational burden in tight loops which I have encountered in https://github.com/JuliaLang/julia/issues/7157 ... some problems might have >1M dimensions but many of them are very sparse in nature 5) I think Mercer- and Nystrom-based approximations are worth a separate package 6) I think for a non-ML practitioner or researcher it is difficult to grasp how to use all this variety of packages in JuliaStats so we might need something like a general purpose "façade" wrapper package to make life easier!

@BigCrunsh and @johnmyleswhite: what do you think?

johnmyleswhite commented 9 years ago

I'm still trying to take a typing break, so this won't be as thorough as I'd like.

(1) I think all of this material needs to be unified over time. My preference has been to pick small problems and get them right, because it's really hard to get the big problems right from the start.

(2) I think the main challenge is to design interfaces that apply very generically and use multiple dispatch for specialization. For instance, you want something like svm(X, y) that gets specialized for every variant you care about: X dense or sparse, Y dense or sparse, X with null values, ....

(3) There's potentially a high cost to abstraction if inlining fails, so jumping to abstractions at the start is very hard to get right while keeping up high performance.

FWIW, I think the Distributions package is our best current example of how to do things -- which does everything in terms of standardized interface + fallbacks that handle problems when there's no custom solution. I guess I tend to think of generic SGD + optimization problem as this fallback in machine learning: in most problems, you have additional structure you can exploit that a generic interface can't know about.

Sorry for not having anything deeper to say.

BigCrunsh commented 9 years ago

I agree with @johnmyleswhite, unification is an ongoing and expensive process. However, that shouldn't stop you / us tackling the points you (@jumutc) mentioned. Specifically, 1) yes, long-term this makes totally sense, but it is probably not the first point to start with. The package RegERMs.jl is (for now) a place for finding more general interfaces and build the connections to the optimization. Once this is sufficiently done, SVM.jl is certainly "just" a special instance and we can think about the integration. 2) awesome! let's move the discussion about the RDA to RegERMs.jl; I am sure we will find a good place for that ;) 3) sure. whatever is necessary and justified. 4) I agree with @johnmyleswhite; the interfaces should be generic and applicable to both. I am not sure, if this package is the right abstraction layer for this purpose (see also https://github.com/JuliaStats/Roadmap.jl/issues/14). 5) Yes, I would be happy to reactivate my Mercer map code again. 6) see 4)

Looking forward to your contributions!

jumutc commented 9 years ago

Hey guys, I have created couple of issues... please consider: https://github.com/JuliaStats/RegERMs.jl/issues/10 https://github.com/JuliaStats/RegERMs.jl/issues/11

JuliaStats / SVM.jl

Use a more general framework for primal SVM #8