Open BigCrunsh opened 10 years ago
My suggestions: we should have three abstract types (probably more, but these are basics): Loss
, Regularizer
, and RegressionProblem
:
An instance of Loss
should be just a functor, for example:
abstract Loss
type SquaredLoss <: Loss end
value(f::SquaredLoss, x::Real) = 0.5 * (x * x)
deriv(f::SquaredLoss, x::Real) = float64(x)
# sometimes, value and deriv can share part of computation
value_and_deriv(f::SquaredLoss, x::Real) = (z = float64(x); (z, 0.5 * (z * z))
The incorporation of input data and loss function should happen during regression. The loss function doesn't have to keep the data and response as fields.
I will talk about Regularizer
and RegressionProblem
later.
The computation of total loss and the gradient w.r.t. the linear coefficients can be readily implemented based on the value
and deriv
functions above, as
n = size(X, 2)
u = X'theta
s = 0.0
d = Array(n)
for i = 1:n
v, d[i] = value_and_gradient(u[i])
s += v
end
grad = X * d
Obviously, one can make this into a function, or adapt it to cases with an intercept.
Similarly, I think w
shouldn't be a field of a regularizer, which instead should be supplied as an argument to the function that evaluates the value and/or gradient.
I agree in principal with these abstract types. For the loss, however, I suggest to have a function that takes a decision value and a label, since there are differences between classification and "regression", e.g.,
value(l::SquaredLoss, f::Real, y::Real) = 0.5 * (f-y) * (f-y)
deriv(l::SquaredLoss, f::Real, y::Real) = float64(f-y)
value(l::LogisticLoss, f::Real, y::Real) = if f>34 -y*f else log(1+exp(-y*f)) end
deriv(l::LogisticLoss, f::Real, y::Real) = -y / (1 + exp(y*f))
Furthermore, I think we need a version that takes a vector of function values, since the loss, e.g., for multi-class logistic regression depends on all decision values.
Sure. The loss should take a predictor value and response value. That was my oversight.
For multi-class logistic, we may have another loss abstract type, maybe called MLoss
or MultivalueLoss
?
Yes, such loss functors should take vector inputs.
or just a different signature value(l::LogisticLoss, f::Real, y::Real)
and value(l::LogisticLoss, f::Vector, y::Real)
?
and perhaps value(l::LogisticLoss, f::Vector)
, which returns the losses for all possible classes.
I think we should have different kinds of loss functions depending on the forms of the input and output.
The most common type is those that takes a scalar predictor and compares it with a scalar response. However, as @BigCrunsh suggested, multi-class logistic regression works a little bit different.
The actually estimation algorithm would be different for these two types of losses. For example, the solution for the former should be a vector of length d
(or d + 1
if bias/intercept is taken into account), while that for the latter should be a matrix of size (d, m)
or (d+1, m)
.
Hence, it would be useful to have multiple abstract types of loss, so that we can dispatch to different estimation procedures depending on the form of loss.
To be more specific, I think the API can be summarized as follows (taking into account @BigCrunsh's comments above):
abstract AbstractLoss
# for scalar -- scalar
abstract Loss <: AbstractLoss
# for multi-class cases
abstract MulticlassLoss <: AbstractLoss
# Specific examples of Loss (e.g. logistic)
type LogisticLoss <: Loss end
# evaluate the value of loss
value(f::LogisticLoss, x::Real, y::Real) = -log1p(exp(-y * x))
# evaluate derivative w.r.t. to x
deriv(f::SquaredLoss, x::Real, y::Real) = (e = exp(-y * x); e / (1 + e))
# fuse the computation of value and derivative
# this example shows that fused computation is more efficient
# when we want both the value and derivative
function value_and_deriv(f::SquaredLoss, x::Real, y::Real)
e = exp(- y * x)
return (-log1p(e), e / (1 + e))
end
# Example of MulticlassLoss
type MulticlassLogisticLoss <: MulticlassLoss end
function value(f::MulticlassLogisticLoss, x::AbstractVector{Real}, y::Integer)
es = exp(x)
log(es[y]) / sum(es)
end
# one can do deriv(f, x, 1:m) to obtain values for all classes jointly
function value(f::MulticlassLogisticLoss, x::AbstractVector{Real}, y::UnitRange)
es = exp(x)
log(es[y]) ./ sum(es)
end
# likewise, we may also define ``deriv`` and ``value_and_deriv``
Various generic functions can be written based on the loss functors:
# total loss and its gradient w.r.t. theta
function tloss_and_gradient(f::Loss,
X::AbstractMatrix,
y::AbstractVector,
theta::AbstractVector)
n = size(X, 2) # n is the number of samples in X
u = X'theta
s = 0.0
dv = zeros(d)
for i = 1:n
v, dv[i] = value_and_deriv(f, u, y[i])
s += v
end
return (s, X * dv) # X * dv is gradient w.r.t. theta
end
function tloss_and_gradient(f::MulticlassLoss,
X::AbstractMatrix,
y::AbstractVector, # a vector of integer labels
theta::AbstractMatrix) # note: theta should be a matrix, size (d, m)
n = size(X, 2)
U = theta'X # size (m, n), where m is the number of samples
D = zeros(m, n)
s = 0.0
for i = 1:n
v, dv = value_and_deriv(f, U[:,i], y[i])
s += v
D[:,i] = dv
end
return (s, A_mul_Bt(X, D))
end
This also demonstrates why we have to introduce Loss
and MulticlassLoss
.
The tloss_and_gradient
is not a great name, and I just use it for example.
Some modification to the details that may make the implementation more efficient.
First, we probably want a value_and_deriv!
for MulticlassLoss
to directly write the derives to a pre-allocated array:
function value_and_deriv!(dv::AbstractVector, f::MulticlassLoss, x::AbstractVector, y::Integer)
# ......
end
# generic for all subtypes of MulticlassLoss
value_and_deriv(f::MulticlassLoss, x::AbstractVector, y::Integer) =
value_and_deriv!(Array(Float64, length(x)), f, x, y)
With this, part of the tloss_and_gradient
can be made more efficient:
dv = zeros(m)
for i = 1:n
v = value_and_deriv!(dv, f, view(U,:,i), y[i])
s += v
D[:,i] = dv
end
This avoids repetitively allocating new vectors within the inner loop.
We can then introduce an abstract type RegressionSolver
to solve a regression analysis problem:
abstract RegressionSolver
type MySolver <: RegressionSolver
....
end
# any solver type should implement a `solve` method as below
function solve(solver::MySolver, f::Loss, r::Regularizer, X::AbstractMatrix, y::AbstractVector)
....
end
# a solver may (optionally) support other types of losses, for example:
function solve(solver::MySolver, f::MulticlassLoss, r::Regularizer, X::AbstractMatrix, y::AbstractVector)
....
end
I hope the several posts above clarify my thoughts about the API design.
Wouldn't the gradient require knowledge of some link function (maybe determined by RegressionProblem
) as well? In tloss_and_gradient
you're calculating the gradient as X * dv
, but in GLMs for example, isn't that only the case if you're using the canonical link?
In this level, we are now only considering loss function and regularization function.
The GLM related stuff like link function and response distribution is at a higher level.
For GLM, we have the loss function is - log p(y|x)
, which is not the link function itself.
After reading again, it looks like the argument f
in value(::Loss, f, y)
is expected to be linear in the coefficients. Is that correct? In that case, the gradient calculation makes sense.
If so, does this mean that a nonlinear regression and multilayer neural nets should not fit in this framework?
The current design is for such an optimization problem:
\sum_i f(\theta^T x; y) + r(\theta)
Here, the first argument to f
is a dot product between the parameter theta
and x
.
It is, however, possible to extend this framework to handle the following generalized cases
\sum_i f( \eta(\theta)^T x; y) + r(\theta)
by introducing an additional nonlinear transform from \theta
to \eta
. @lendle: I guess this is what you mean by nonlinear regression?
@lindahua By nonlinear regression I meant solving the optimization problem
\sum_i f(h(\theta, x); y) + r(\theta)
for some known, nonlinear h
as in nonlinear least squares.
I see. It is not difficult to generalize to such cases. I will follow up later.
I suppose the easiest way would be to have something like
abstract AbstractLoss
abstract Loss <: AbstractLoss
abstract LinLoss <: Loss #probably need a better name
type LogisticLoss <: LinLoss end
type SquaredLoss <: LinLoss end
and define a default tloss_and_gradient!
for LinLoss
based on value_and_deriv
defined for subtypes of LinLoss
. Other subtypes of Loss
that are not subtypes of LinLoss
will have to define their own tloss_and_gradient!
method in addition to value_and_deriv
.
Another option for names, since it's probably most common for the h
function above to be h(theta, x) = theta'x
:
abstract AbstractLoss
abstract ScalarLoss <: AbstractLoss
abstract Loss <: ScalarLoss
abstract NonLinLoss <: ScalarLoss
type LogisticLoss <: Loss end
type SquaredLoss <: Loss end
...
Then loss functions where h
is nonlinear can extend NonLinLoss
.
It might be better for tloss_and_gradient
to calculate the mean loss instead of sum, so that the loss is O(1) instead of O(n), and the amount of regularization does not depend on sample size.
Generally, the relative amount of regularization does depend on the sample size.
This can be easily seen from a probabilistic perspective. Consider an MAP problem, as
p(\theta) \prod_i p(y_i|x_i; \theta)
This can be turned into a sum:
\log p(\theta) + \sum_i log p(y_i|x_i; \theta)
I don't think it is correct to take the mean in the second term.
This makes a lot of sense. When you have only a small set of observations, you rely more on the prior, however, as more samples are available, the prior would play a relatively smaller roles.
That's true that you'd want the relative amount of regularization to decrease as your total sample size increases, and typically, you'd probably want to calculate the loss/gradient on the full data set.
In my mind, I guess I think of the optimization problem as the mean loss + regularization, and I'd think of decreasing the relative regularization by decreasing the penalty parameter as sample size increases. I'm probably in the minority there, and it doesn't really matter anyway most of the time.
A use case I was thinking of where it does matter is SGD/mini-batch GD, where you'd calculate the loss/gradient on a small fraction of the full data set, and you could think of that as an estimate of the loss/gradient calculated on the full data set. It's a lot simpler to think about loss + regularization as essentially the same whether you're calculating the loss on 1 observation or n. Hopefully I've articulated that clearly, let me know if not.
@lindahua What are your thoughts on that?
Edit: Maybe I'm not as much in the minority as I previously thought. R's glmnet, scikit-learn's ElasticNet and StatsModels' OLS.fit_regularized all minimize the mean loss + regularization.
The behavior as to whether to use the mean or sum can be specified by a keyword argument.
Oh! You guys are very quick. I am trying to order my thoughts:
Loss types
The actually estimation algorithm would be different for these two types of losses. For example, the solution for the former should be a vector of length d (or d + 1 if bias/intercept is taken into account), while that for the latter should be a matrix of size (d, m) or (d+1, m).
Hence, it would be useful to have multiple abstract types of loss, so that we can dispatch to different estimation procedures depending on the form of loss.
The former is just a special case of the latter for m=1
, isn't it? I see no difference between the estimation procedures. I would like to keep it simple as possible. Could we start with just one common abstract loss type and, if it becomes necessary, make this distinction?
Total loss / Non-linear models / Link functions
As @lendle mentioned, in @lindahua's version tloss_and_gradient
is computing the loss over all examples for a linear model. I see no need to restrict ourselves to linear models and I would like to see that in a bit more general way. The general objective, which we aim to minimize, is
L[\theta] = \sum_i \ell(f_\theta(x_i), y_i) + \Omega[f_\theta],
for some model parameter vector \theta
, loss function \ell
and regularizer \Omega
. This formulation allows us also to incorporate linear models f_\theta(x)=f_w(x)=w*x
. Although non-linear models can be traced back to linear models via the mercer_map.jl
(as in the current implementation), the formulation above includes also the native dual model f_\theta(x)=f_\alpha(x)=\sum_i \alpha_i *k(x_i,x)
.
The general gradient is then
L'[\theta] = \sum_i f'_\theta(x_i)\ell'(f_\theta(x_i), y_i) + f'_\theta\Omega'[f_\theta],
where the gradients f', \ell',...
are taken w.r.t. to \theta
.
So, instead of having a vector theta, I suggest to have a type model here, which has to provide the functionality of a value and a gradient w.r.t. the model parameters. Then, the loss itself is independent of how the decision values are computed. This was the intend in the current implementation for model
.
I am not absolutely sure, but isn't a link function just reflected by a specific type of a loss function? If so, it is already captured, right @lendle.
SGD/mini-batch
I agree with @lindahua that we should keep the analogy of RegERMs to MAP estimation. And yes, @lendle, we also need the possibility to calculate stochastic gradients of the object for SGD/mini batches. But this isn't a problem (see e.g., RegERMs.jl (line 23)). We should have also a function like stochastic_loss_and_gradient
that computes a part of the gradient as:
\sum_{i \in S} \ell(f, x_i, y_i) + \frac{1}{|S|}\Omega[f],
where S
is the subset of all examples.
Log loss
Bit unrelated, but for later numerical stability, we should "approximate" the log loss log(f,x,y)
by -y*f
for f>34
and for the multi-class version (soft-max) subtract the maximum over all decision values before calculating the exp
(es = exp(x-max)
; this is equivalent but stable).
What do you guys think of a bottom-up approach. Where we just start to try this on an example in a WIP branch and look how it feels? I added a first attempt here https://github.com/JuliaStats/RegERMs.jl/pull/4 and tried some concepts. Feel free to comment (it is neither complete nor efficient yet).
@BigCrunsh Thanks for starting working on it. Would you please also try implement the multi-class logistic regression? I think that would show how the optimization procedure for multi-class logistic regression and standard logistic regression are quite different from each other.
See my comments. There are a few efficiency suggestions. But the main issue that I see is that you are currently using symbols to indicate different solvers, that would make the framework difficult to extend. It should be better to introduce RegressionSolver
and a set of subtypes to indicate different solvers
abstract RegressionSolver
type SGDSolver <: RegressionSolver
# some fields to specify how the learning rates are defined
end
function optimize(solver::SGDSolver, X::AbstractMatrix, y::AbstractVector)
....
end
type BFGSSolver <: RegressionSolver end
function optimize(solver::BFGSSolver, X::AbstractMatrix, y::AbstractVector)
....
end
Any one can implement his own solver without hacking around the if-else
statements.
In terms of generalizing to nonlinear regression, I think @lendle's formulation is to use the following as the risk term.
loss(h(x, \theta), y)
I agree with that we should support this.
Generally, the risk computation consists of two steps:
h(x, \theta)
. In linear cases, h(x, \theta) = \theta'x
. loss
function.These two steps are kind of orthogonal to each other. Using GLM's terminology, h
may be called a Predictor
.
It is not difficult to generalize the framework by incorporating a predictor in the optimize
function. A predictor
should be able to produce a prediction value and a derivative w.r.t. \theta
. When the predictor is omitted, we fallback to the standard linear cases.
function optimize(solver, predictor, X, y)
function optimize(solver, X, y) # falls back to linear cases
@lindahua: yes, I agree. A soon as I have time, I will start with the multi-class logreg. Currently, I just want to see how conceptual things are working. Feel free to join (especially feel free fixing the efficiency issues).
Regarding the RegressionSolver
in principle, I also agree. However, I would like to build up on Optim.jl
whenever it is possible in order to avoid double work.
I agree with first focus on getting things work. I can always help to optimize the performance once your PR is merged.
We can have a BFGSSolver
that simply delegates to the BFGS algorithm in Optims, and we can have a SGDSolver
also.
In this way, we don't repeat the work already done in Optims, we just use them.
The purpose of different solver subtypes is primarily for the purpose of dispatch. The internal implementation can be delegated to other's functions or implemented by ourselves.
Sorry for belaboring the sum/mean loss issue, but I have a couple more arguments in favor of using the mean. I'll (probably) shut up after this!
Risk interpretation
Risk is defined as the mean loss w.r.t. some distribution and for the empirical risk that distribution is the empirical distribution, so by definition ERM is minimizing a mean loss. We could then use risk_and_gradient
instead of tloss_and_gradient
.
Statistically, our true goal is typically to minimize some true risk on new data
R_0(f_theta) = E_0 \ell(f_\theta, X, Y)
which we attempt to do by minimizing the empirical risk + regularizer
R_n(f_\theta) = E_n \ell(f_\theta, X,Y) + \Omega[f_\theta].
It's appealing to me that the objective we're optimizing numerically, R_n
, can be thought of as an estimate of our ideal objective R_0
, and that as n
increases the expectation of R_n
at a given f_\theta
is constant. In the sum loss case, the regularization term gets dominated by the sum loss, so the object function is always changing with n
.
As I mentioned before, the risk on a mini-batch
R_{S_n}(f_\theta) = E_{S_n} \ell(f_\theta, X,Y) + \Omega[f_\theta]
can also then be thought of as an estimate of R_n
where E_{S_n}
is the expectation w.r.t. the empirical distribution of the mini-batch S
. This thinking also extends to (cross-)validation, where the risk on an independent validation set V_n
,
R_{V_n}(f_\theta) = E_{V_n} \ell(f_\theta, X, Y)
is an unbiased estimate of R_0
.
SGD/mini-batch
As @BigCrunsh points out, the discrepancy in sum loss in mini-batch vs. batch estimation can be overcome by scaling appropriately, but this requires knowing total sample size n
. To me, though, the mean loss is simpler because we don't need to worry about when to scale and when not. The risk+regularizer on any subset is always an approximation of the risk+regularizer on the full data set.
Also, I'm not sure if online optimization is within the scope of this package, but in an online setting, we often don't know n
ahead of time, so scaling would be impossible.
Relative strength of regularization/MAP interpretation
One advantage @lindahua points out is that the sum loss formulation has over mean loss is that the role of the optimizer is reduced automatically as sample size increases. In practice I don't think this matters too much though, because we should really select regularization parameters with something like cross validation where training set sizes are order n
.
I'm not much of a Bayesian so I don't usually think in terms of MAP estimation, but I do like the interpretation and I think it's useful. If we did go for the mean loss formula, it would not be easy to allow for a MAP interpretation by just scaling the regularizer by 1/n
.
Summary:
1) It's easier to be internally consistent with mean loss. Whenever calculating a risk on some collection of m
observations, always take the mean then add the regularizer. Don't need to worry about scaling anything after that if m < n
.
2) A risk on some set of observations (subset of all, training set, validation set, whatever) can be though of as an estimate of true risk R_0
.
3) It's easy to formulate the problem as a MAP problem by scaling the regularizer by 1/n
. Could be provided as a kwarg to the Regularizer
or RegressionProblem
object. This would even be fine with me if it were the default, I just strongly prefer that we use the mean loss internally.
I am not absolutely sure, but isn't a link function just reflected by a specific type of a loss function? If so, it is already captured, right @lendle.
Yes, at least in a link function in the GLM sense.
@lendle I do see that the mean of loss is useful in many of the cases that you mention, however, sum of losses are also useful, particular in MAP estimation (probably because my background is closely related to Bayesian statistics, I saw a lot of such cases).
My suggestion is to keep both options (preferably via a keyword argument), so that people from different domains all find it useful.
@lindahua That sounds good to me. I forgot to mention that I am also in favor of your suggestion for a kw arg!
There are even more equivalent formulations of the optimisation problem, which we might want to support. E.g., the optimisation problem
\sum_i \ell(f, x_i, y_i) + \rho_1\Omega[f]
is equivalent to the constraint version
\sum_i \ell(f, x_i, y_i), s.t., \Omega[f]<\rho_2
which is optimized via SGD (pegasos) in https://github.com/JuliaStats/SVM.jl.
Hmm, it seems the interface you are designing assumes that all solvers should be able to solve for all types of loss functions. Isn't this a limitation in case some solvers only work for some subset of loss functions? They might have benefits for some specific loss function for example.
Sorry if I'm coming into the discussion too late or misunderstood anything.
I don't think that we require each solver to be able to solve everything.
Using multiple dispatch, a solver can implement for a certain family of problems.
This issue carries on the discussion at https://github.com/JuliaStats/Roadmap.jl/issues/14 with the intend to make things more concrete. To start with this, I see a list of apparent questions and TODOs.
As always your suggestions and opinions are really appreciated:
Interfaces Basic functionality of a regression model should be 1) optimise the model parameters using a set of instances and labels and 2) make a prediction for a set of instances. Perhaps, we should do some renaming, e.g.:
See also https://github.com/JuliaStats/Roadmap.jl/wiki/Proposal:-interface-for-JuliaStats-packages-(Statistics-and-Machine-Learning).
Instances of RegERMs What should happen with certain instances like SVM, LogReg, RidgeRegression, Lasso,... ?
loss.jl / regularizer.jl The aim of this package is to combine these concepts into the regularised empirical risk minimisation framework. For now, it makes sense to have them in this package and conflate them with existing approaches like https://github.com/johnmyleswhite/Loss.jl/.
sgd.jl Stochastic gradient descent methods are typical solvers for RegERMs. I am not sure where it fits best and who else uses it. Should it be moved to (or just replaced by) https://github.com/johnmyleswhite/SGD.jl/?
Kernels
Cross-validation Cross-validation is important for estimating the performance of a given model and for hyper-parameter tuning. The regularisation parameter is actually just another model parameter, which has to be optimised. Thus, procedures like marginal likelihood tuning and cross-validation should be located at the solver level. See, e.g., https://github.com/JuliaStats/MLBase.jl.
I would like to make the whole discussion a bit more specific and apply / adapt these concepts asap to an example like logistic regression (multi-class, kernelized) to see limits and problems.