Closed ahwillia closed 8 years ago
Prox would be very useful indeed! The penalities here were initially ported from OnlineStats but adapted a bit. I remember that the approach taken there for prox did for some reason not completely align with the design decisions here so I removed them (If I recall correctly).
Concerning the place to put optimization routines I vote we should have those separate from LearnBase. @tbreloff what is your take on this?
Concerning your corollaries: We settled for a simplified naming convention (since every subfield calls and denotes everything completely differently).
ModelLoss
are things like a HingeLoss or a L2Loss for example. These are just functions of a prediction output
and a true label target
. The losses themselves are only defined for real numbers, which means they just define a deriv
function. Vector input has a general grad
implementation here that delegates each element. There is no risk model in sight, so it knows nothing about a prediction function and thus nothing about any derivative of a prediction function.ParameterLoss
would be our name for a penalty, which is just a function of the coefficient vector. the grad
is done here. also of course no risk model needed.So I am not sure what you mean with tied to risk models. The risk stuff is completely separate and is mainly there to tie the model loss and the prediction function efficiently together. So the risk stuff uses the losses, but not the other way around. It was very important to me that the losses can be used on their own.
To corollary 2. I am not sure, but I think having a distinct Penalty for the combination-of-two-penalty is a decent solution (like https://github.com/Evizero/LearnBase.jl/blob/master/src/loss/params.jl#L76)
I think prox methods definitely belong here alongside the penalties, but an ADMM implementation belongs somewhere else (at least for now). It's much easier to add things than remove them.
While not true in general, the prox operators I use can be performed element-wise (some algorithms need this).
https://github.com/joshday/SparseRegression.jl/blob/master/src/penalty.jl
Alright, following up from https://github.com/JuliaML/Roadmap.jl/issues/8 this package now focuses on loss function. I am not very educated on prox but as far as I know they are usually a part of the penalties and not the losses. So I guess this part is thus also outsourced to the place where penalties end up
They can be both. For example, useful if the loss function is not smooth (e.g. minimizing abs(y - yhat)
as a form of robust regression). This can stay closed though. I'll open pull requests where I see fit.
I see. Sure, please do
I like the idea of having a standard bank of loss functions and parameter penalties. One thing that would be very useful to compute for all instances of this would be proximal mappings. These form the basis for a large class of optimization algorithms for non-smooth functions (authoritative review here: http://stanford.edu/~boyd/papers/prox_algs.html).
The definition of the proximal operator/mapping is:
Where
f
is the function (typically a penalty or loss),x0
is the current parameter guess, andrho
tunes the step size. If rho is small (near zero) then the parameter update (x - x0
) is in the direction of the gradient -- i.e. the proximal mapping performs gradient descent (albeit with small step sizes). See the review above for more intuitions.To start, I would propose two new functions
prox
andprox!
for loss and penalty functions. For the L1 penalty:I have some rough code here, which I am happy to port over: https://github.com/ahwillia/ProxAlgs.jl
I also have some optimization routines implemented in that package, like ADMM. I could port those over as well, but as I brought up in https://github.com/Evizero/LearnBase.jl/issues/22 -- I'm still a bit conflicted over whether we should be fleshing out full optimization routines in this package.
Corollary 1: The
grad
andgrad!
functions are a bit tough for me to parse by just perusing over the source code. Is there a reason for these to be so closely tied to risk models? I think it makes sense forprox
andprox!
to mirror how we calculate gradients.Corollary 2: Have we thought about how to represent objective functions with multiple penalties. For example, I have to implement different prox operators for
L1
,L2
, andElasticNet
, even though elastic net is just a linear combination ofL1
andL2
penalties. In other words theprox(f+g,x0,r) =\= prox(f,x0,r) + prox(g,x0,r)