Add proximal mappings - Githubissues

ahwillia commented 8 years ago

I like the idea of having a standard bank of loss functions and parameter penalties. One thing that would be very useful to compute for all instances of this would be proximal mappings. These form the basis for a large class of optimization algorithms for non-smooth functions (authoritative review here: http://stanford.edu/~boyd/papers/prox_algs.html).

The definition of the proximal operator/mapping is:

prox(f,x0,rho) = argmin_x f(x) + (0.5/rho)*norm(x-x0)^2

Where f is the function (typically a penalty or loss), x0 is the current parameter guess, and rho tunes the step size. If rho is small (near zero) then the parameter update (x - x0) is in the direction of the gradient -- i.e. the proximal mapping performs gradient descent (albeit with small step sizes). See the review above for more intuitions.

To start, I would propose two new functions prox and prox! for loss and penalty functions. For the L1 penalty:

function prox(r::L1Penalty, x0::Array, ρ::Float64)
    return max(0, x0 - ρ) - max(0, -x0 - ρ)
end

function prox!(r::L1Penalty, x0::Array, ρ::Float64)
    for i = 1:length(x0)
        x0[i] =  max(0,x0[i] - ρ) - max(0, -x0[i] - ρ)
    end
end

I have some rough code here, which I am happy to port over: https://github.com/ahwillia/ProxAlgs.jl

I also have some optimization routines implemented in that package, like ADMM. I could port those over as well, but as I brought up in https://github.com/Evizero/LearnBase.jl/issues/22 -- I'm still a bit conflicted over whether we should be fleshing out full optimization routines in this package.

Corollary 1: The grad and grad! functions are a bit tough for me to parse by just perusing over the source code. Is there a reason for these to be so closely tied to risk models? I think it makes sense for prox and prox! to mirror how we calculate gradients.

Corollary 2: Have we thought about how to represent objective functions with multiple penalties. For example, I have to implement different prox operators for L1, L2, and ElasticNet, even though elastic net is just a linear combination of L1 and L2 penalties. In other words the prox(f+g,x0,r) =\= prox(f,x0,r) + prox(g,x0,r)

Evizero commented 8 years ago

Prox would be very useful indeed! The penalities here were initially ported from OnlineStats but adapted a bit. I remember that the approach taken there for prox did for some reason not completely align with the design decisions here so I removed them (If I recall correctly).

Concerning the place to put optimization routines I vote we should have those separate from LearnBase. @tbreloff what is your take on this?

Concerning your corollaries: We settled for a simplified naming convention (since every subfield calls and denotes everything completely differently).

A ModelLoss are things like a HingeLoss or a L2Loss for example. These are just functions of a prediction output and a true label target. The losses themselves are only defined for real numbers, which means they just define a deriv function. Vector input has a general grad implementation here that delegates each element. There is no risk model in sight, so it knows nothing about a prediction function and thus nothing about any derivative of a prediction function.
A ParameterLoss would be our name for a penalty, which is just a function of the coefficient vector. the grad is done here. also of course no risk model needed.

So I am not sure what you mean with tied to risk models. The risk stuff is completely separate and is mainly there to tie the model loss and the prediction function efficiently together. So the risk stuff uses the losses, but not the other way around. It was very important to me that the losses can be used on their own.

To corollary 2. I am not sure, but I think having a distinct Penalty for the combination-of-two-penalty is a decent solution (like https://github.com/Evizero/LearnBase.jl/blob/master/src/loss/params.jl#L76)

joshday commented 8 years ago

I think prox methods definitely belong here alongside the penalties, but an ADMM implementation belongs somewhere else (at least for now). It's much easier to add things than remove them.

While not true in general, the prox operators I use can be performed element-wise (some algorithms need this).
https://github.com/joshday/SparseRegression.jl/blob/master/src/penalty.jl

Evizero commented 8 years ago

Alright, following up from https://github.com/JuliaML/Roadmap.jl/issues/8 this package now focuses on loss function. I am not very educated on prox but as far as I know they are usually a part of the penalties and not the losses. So I guess this part is thus also outsourced to the place where penalties end up

ahwillia commented 8 years ago

They can be both. For example, useful if the loss function is not smooth (e.g. minimizing abs(y - yhat) as a form of robust regression). This can stay closed though. I'll open pull requests where I see fit.

Evizero commented 8 years ago

I see. Sure, please do

JuliaML / LossFunctions.jl

Add proximal mappings #23