Design of a general API for Regression problems

BigCrunsh commented 10 years ago

This issue carries on the discussion at https://github.com/JuliaStats/Roadmap.jl/issues/14 with the intend to make things more concrete. To start with this, I see a list of apparent questions and TODOs.

As always your suggestions and opinions are really appreciated:

Interfaces Basic functionality of a regression model should be 1) optimise the model parameters using a set of instances and labels and 2) make a prediction for a set of instances. Perhaps, we should do some renaming, e.g.:

Model -> RegressionModel; make it abstract
classify -> predict?
more comprehensive name for RegERMs?

Instances of RegERMs What should happen with certain instances like SVM, LogReg, RidgeRegression, Lasso,... ?

implemented / instantiated in this package
moved to subfolder
own packages which are based on this package for each of these instances.

loss.jl / regularizer.jl The aim of this package is to combine these concepts into the regularised empirical risk minimisation framework. For now, it makes sense to have them in this package and conflate them with existing approaches like https://github.com/johnmyleswhite/Loss.jl/.

sgd.jl Stochastic gradient descent methods are typical solvers for RegERMs. I am not sure where it fits best and who else uses it. Should it be moved to (or just replaced by) https://github.com/johnmyleswhite/SGD.jl/?

Kernels

Currently kernels are only implemented via mercer_map.jl. The concrete kernel functions probably should be moved into a different package, since 1) the concept of kernels is quite general can also be used by e.g., https://github.com/JuliaStats/HypothesisTests.jl and 2) there are a lot of related methods, e.g., for approximating kernels (for efficiency) which are too specific but useful to have in general.
mercer_map.jl: computes a finite mapping w.r.t. a given kernel for a fixed training set. It feels like that this implementation of kernels belongs to the model level.

Cross-validation Cross-validation is important for estimating the performance of a given model and for hyper-parameter tuning. The regularisation parameter is actually just another model parameter, which has to be optimised. Thus, procedures like marginal likelihood tuning and cross-validation should be located at the solver level. See, e.g., https://github.com/JuliaStats/MLBase.jl.

I would like to make the whole discussion a bit more specific and apply / adapt these concepts asap to an example like logistic regression (multi-class, kernelized) to see limits and problems.

lindahua commented 10 years ago

My suggestions: we should have three abstract types (probably more, but these are basics): Loss, Regularizer, and RegressionProblem:

An instance of Loss should be just a functor, for example:

abstract Loss

type SquaredLoss <: Loss end

value(f::SquaredLoss, x::Real) = 0.5 * (x * x)
deriv(f::SquaredLoss, x::Real) = float64(x)

# sometimes, value and deriv can share part of computation
value_and_deriv(f::SquaredLoss, x::Real) = (z = float64(x); (z, 0.5 * (z * z))

The incorporation of input data and loss function should happen during regression. The loss function doesn't have to keep the data and response as fields.

I will talk about Regularizer and RegressionProblem later.

lindahua commented 10 years ago

The computation of total loss and the gradient w.r.t. the linear coefficients can be readily implemented based on the value and deriv functions above, as

n = size(X, 2)
u = X'theta
s = 0.0
d = Array(n)
for i = 1:n
    v, d[i] = value_and_gradient(u[i])
    s += v
end
grad = X * d

Obviously, one can make this into a function, or adapt it to cases with an intercept.

lindahua commented 10 years ago

Similarly, I think w shouldn't be a field of a regularizer, which instead should be supplied as an argument to the function that evaluates the value and/or gradient.

BigCrunsh commented 10 years ago

I agree in principal with these abstract types. For the loss, however, I suggest to have a function that takes a decision value and a label, since there are differences between classification and "regression", e.g.,

value(l::SquaredLoss, f::Real, y::Real) = 0.5 * (f-y) * (f-y)
deriv(l::SquaredLoss, f::Real, y::Real) = float64(f-y)

value(l::LogisticLoss, f::Real, y::Real) = if f>34 -y*f else log(1+exp(-y*f)) end
deriv(l::LogisticLoss, f::Real, y::Real) = -y / (1 + exp(y*f))

Furthermore, I think we need a version that takes a vector of function values, since the loss, e.g., for multi-class logistic regression depends on all decision values.

lindahua commented 10 years ago

Sure. The loss should take a predictor value and response value. That was my oversight.

lindahua commented 10 years ago

For multi-class logistic, we may have another loss abstract type, maybe called MLoss or MultivalueLoss?

Yes, such loss functors should take vector inputs.

BigCrunsh commented 10 years ago

or just a different signature value(l::LogisticLoss, f::Real, y::Real) and value(l::LogisticLoss, f::Vector, y::Real)?

BigCrunsh commented 10 years ago

and perhaps value(l::LogisticLoss, f::Vector), which returns the losses for all possible classes.

lindahua commented 10 years ago

I think we should have different kinds of loss functions depending on the forms of the input and output.

The most common type is those that takes a scalar predictor and compares it with a scalar response. However, as @BigCrunsh suggested, multi-class logistic regression works a little bit different.

The actually estimation algorithm would be different for these two types of losses. For example, the solution for the former should be a vector of length d (or d + 1 if bias/intercept is taken into account), while that for the latter should be a matrix of size (d, m) or (d+1, m).

Hence, it would be useful to have multiple abstract types of loss, so that we can dispatch to different estimation procedures depending on the form of loss.

To be more specific, I think the API can be summarized as follows (taking into account @BigCrunsh's comments above):

abstract AbstractLoss

# for scalar -- scalar
abstract Loss <: AbstractLoss

# for multi-class cases
abstract MulticlassLoss <: AbstractLoss

# Specific examples of Loss (e.g. logistic)
type LogisticLoss <: Loss end

# evaluate the value of loss
value(f::LogisticLoss, x::Real, y::Real) = -log1p(exp(-y * x))

# evaluate derivative w.r.t. to x
deriv(f::SquaredLoss, x::Real, y::Real) = (e = exp(-y * x); e / (1 + e))

# fuse the computation of value and derivative
# this example shows that fused computation is more efficient
# when we want both the value and derivative
function value_and_deriv(f::SquaredLoss, x::Real, y::Real) 
    e = exp(- y * x)
    return (-log1p(e), e / (1 + e))
end

# Example of MulticlassLoss
type MulticlassLogisticLoss <: MulticlassLoss end

function value(f::MulticlassLogisticLoss, x::AbstractVector{Real}, y::Integer)
    es = exp(x)
    log(es[y]) / sum(es)
end

# one can do deriv(f, x, 1:m) to obtain values for all classes jointly
function value(f::MulticlassLogisticLoss, x::AbstractVector{Real}, y::UnitRange)
    es = exp(x)
    log(es[y]) ./ sum(es) 
end

# likewise, we may also define ``deriv`` and ``value_and_deriv``

lindahua commented 10 years ago

Various generic functions can be written based on the loss functors:

# total loss and its gradient w.r.t. theta
function tloss_and_gradient(f::Loss, 
                    X::AbstractMatrix, 
                    y::AbstractVector, 
                    theta::AbstractVector)
    n = size(X, 2)  # n is the number of samples in X
    u = X'theta    
    s = 0.0
    dv = zeros(d)
    for i = 1:n
        v, dv[i] = value_and_deriv(f, u, y[i])
        s += v
    end
    return (s, X * dv)  # X * dv is gradient w.r.t. theta
end

function tloss_and_gradient(f::MulticlassLoss, 
                    X::AbstractMatrix, 
                    y::AbstractVector,  # a vector of integer labels 
                    theta::AbstractMatrix)  # note: theta should be a matrix, size (d, m)
    n = size(X, 2) 
    U = theta'X  # size (m, n), where m is the number of samples
    D = zeros(m, n)
    s = 0.0
    for i = 1:n
        v, dv = value_and_deriv(f, U[:,i], y[i])
        s += v
        D[:,i] = dv
    end
    return (s, A_mul_Bt(X, D))
end

This also demonstrates why we have to introduce Loss and MulticlassLoss.

The tloss_and_gradient is not a great name, and I just use it for example.

lindahua commented 10 years ago

Some modification to the details that may make the implementation more efficient.

First, we probably want a value_and_deriv! for MulticlassLoss to directly write the derives to a pre-allocated array:

function value_and_deriv!(dv::AbstractVector, f::MulticlassLoss, x::AbstractVector, y::Integer)
    # ......
end

# generic for all subtypes of MulticlassLoss
value_and_deriv(f::MulticlassLoss, x::AbstractVector, y::Integer) = 
    value_and_deriv!(Array(Float64, length(x)), f, x, y)

With this, part of the tloss_and_gradient can be made more efficient:

    dv = zeros(m) 
    for i = 1:n
        v = value_and_deriv!(dv, f, view(U,:,i), y[i])
        s += v
        D[:,i] = dv
    end

This avoids repetitively allocating new vectors within the inner loop.

lindahua commented 10 years ago

We can then introduce an abstract type RegressionSolver to solve a regression analysis problem:

abstract RegressionSolver

type MySolver <: RegressionSolver
    ....
end

# any solver type should implement a `solve` method as below
function solve(solver::MySolver, f::Loss, r::Regularizer, X::AbstractMatrix, y::AbstractVector)
    ....
end

# a solver may (optionally) support other types of losses, for example:
function solve(solver::MySolver, f::MulticlassLoss, r::Regularizer, X::AbstractMatrix, y::AbstractVector)
    ....
end

lindahua commented 10 years ago

I hope the several posts above clarify my thoughts about the API design.

lendle commented 10 years ago

Wouldn't the gradient require knowledge of some link function (maybe determined by RegressionProblem) as well? In tloss_and_gradient you're calculating the gradient as X * dv, but in GLMs for example, isn't that only the case if you're using the canonical link?

lindahua commented 10 years ago

In this level, we are now only considering loss function and regularization function.

The GLM related stuff like link function and response distribution is at a higher level.

For GLM, we have the loss function is - log p(y|x), which is not the link function itself.

lendle commented 10 years ago

After reading again, it looks like the argument f in value(::Loss, f, y) is expected to be linear in the coefficients. Is that correct? In that case, the gradient calculation makes sense.

If so, does this mean that a nonlinear regression and multilayer neural nets should not fit in this framework?

lindahua commented 10 years ago

The current design is for such an optimization problem:

\sum_i f(\theta^T x; y) + r(\theta)

Here, the first argument to f is a dot product between the parameter theta and x.

It is, however, possible to extend this framework to handle the following generalized cases

\sum_i f( \eta(\theta)^T x; y) + r(\theta)

by introducing an additional nonlinear transform from \theta to \eta. @lendle: I guess this is what you mean by nonlinear regression?

lendle commented 10 years ago

@lindahua By nonlinear regression I meant solving the optimization problem

\sum_i f(h(\theta, x); y) + r(\theta)

for some known, nonlinear h as in nonlinear least squares.

lindahua commented 10 years ago

I see. It is not difficult to generalize to such cases. I will follow up later.

lendle commented 10 years ago

I suppose the easiest way would be to have something like

abstract AbstractLoss
abstract Loss <: AbstractLoss
abstract LinLoss <: Loss #probably need a better name
type LogisticLoss <: LinLoss end
type SquaredLoss <: LinLoss end

and define a default tloss_and_gradient! for LinLoss based on value_and_deriv defined for subtypes of LinLoss. Other subtypes of Loss that are not subtypes of LinLoss will have to define their own tloss_and_gradient! method in addition to value_and_deriv.

Another option for names, since it's probably most common for the h function above to be h(theta, x) = theta'x:

abstract AbstractLoss
abstract ScalarLoss <: AbstractLoss 
abstract Loss <: ScalarLoss
abstract NonLinLoss <: ScalarLoss
type LogisticLoss <: Loss end
type SquaredLoss <: Loss end
...

Then loss functions where h is nonlinear can extend NonLinLoss.

lendle commented 10 years ago

It might be better for tloss_and_gradient to calculate the mean loss instead of sum, so that the loss is O(1) instead of O(n), and the amount of regularization does not depend on sample size.

lindahua commented 10 years ago

Generally, the relative amount of regularization does depend on the sample size.

This can be easily seen from a probabilistic perspective. Consider an MAP problem, as

p(\theta) \prod_i p(y_i|x_i; \theta)

This can be turned into a sum:

\log p(\theta) + \sum_i log p(y_i|x_i; \theta)

I don't think it is correct to take the mean in the second term.

This makes a lot of sense. When you have only a small set of observations, you rely more on the prior, however, as more samples are available, the prior would play a relatively smaller roles.

lendle commented 10 years ago

That's true that you'd want the relative amount of regularization to decrease as your total sample size increases, and typically, you'd probably want to calculate the loss/gradient on the full data set.

In my mind, I guess I think of the optimization problem as the mean loss + regularization, and I'd think of decreasing the relative regularization by decreasing the penalty parameter as sample size increases. I'm probably in the minority there, and it doesn't really matter anyway most of the time.

A use case I was thinking of where it does matter is SGD/mini-batch GD, where you'd calculate the loss/gradient on a small fraction of the full data set, and you could think of that as an estimate of the loss/gradient calculated on the full data set. It's a lot simpler to think about loss + regularization as essentially the same whether you're calculating the loss on 1 observation or n. Hopefully I've articulated that clearly, let me know if not.

@lindahua What are your thoughts on that?

Edit: Maybe I'm not as much in the minority as I previously thought. R's glmnet, scikit-learn's ElasticNet and StatsModels' OLS.fit_regularized all minimize the mean loss + regularization.

lindahua commented 10 years ago

The behavior as to whether to use the mean or sum can be specified by a keyword argument.

BigCrunsh commented 10 years ago

Oh! You guys are very quick. I am trying to order my thoughts:

Loss types

The actually estimation algorithm would be different for these two types of losses. For example, the solution for the former should be a vector of length d (or d + 1 if bias/intercept is taken into account), while that for the latter should be a matrix of size (d, m) or (d+1, m).

Hence, it would be useful to have multiple abstract types of loss, so that we can dispatch to different estimation procedures depending on the form of loss.

The former is just a special case of the latter for m=1, isn't it? I see no difference between the estimation procedures. I would like to keep it simple as possible. Could we start with just one common abstract loss type and, if it becomes necessary, make this distinction?

Total loss / Non-linear models / Link functions As @lendle mentioned, in @lindahua's version tloss_and_gradient is computing the loss over all examples for a linear model. I see no need to restrict ourselves to linear models and I would like to see that in a bit more general way. The general objective, which we aim to minimize, is

L[\theta] = \sum_i \ell(f_\theta(x_i), y_i) + \Omega[f_\theta],

for some model parameter vector \theta, loss function \ell and regularizer \Omega. This formulation allows us also to incorporate linear models f_\theta(x)=f_w(x)=w*x. Although non-linear models can be traced back to linear models via the mercer_map.jl (as in the current implementation), the formulation above includes also the native dual model f_\theta(x)=f_\alpha(x)=\sum_i \alpha_i *k(x_i,x).

The general gradient is then

L'[\theta] = \sum_i f'_\theta(x_i)\ell'(f_\theta(x_i), y_i) + f'_\theta\Omega'[f_\theta],

where the gradients f', \ell',... are taken w.r.t. to \theta.

So, instead of having a vector theta, I suggest to have a type model here, which has to provide the functionality of a value and a gradient w.r.t. the model parameters. Then, the loss itself is independent of how the decision values are computed. This was the intend in the current implementation for model.

I am not absolutely sure, but isn't a link function just reflected by a specific type of a loss function? If so, it is already captured, right @lendle.

SGD/mini-batch I agree with @lindahua that we should keep the analogy of RegERMs to MAP estimation. And yes, @lendle, we also need the possibility to calculate stochastic gradients of the object for SGD/mini batches. But this isn't a problem (see e.g., RegERMs.jl (line 23)). We should have also a function like stochastic_loss_and_gradient that computes a part of the gradient as:

\sum_{i \in S} \ell(f, x_i, y_i) + \frac{1}{|S|}\Omega[f],

where S is the subset of all examples.

Log loss Bit unrelated, but for later numerical stability, we should "approximate" the log loss log(f,x,y) by -y*f for f>34 and for the multi-class version (soft-max) subtract the maximum over all decision values before calculating the exp (es = exp(x-max); this is equivalent but stable).

BigCrunsh commented 10 years ago

What do you guys think of a bottom-up approach. Where we just start to try this on an example in a WIP branch and look how it feels? I added a first attempt here https://github.com/JuliaStats/RegERMs.jl/pull/4 and tried some concepts. Feel free to comment (it is neither complete nor efficient yet).

lindahua commented 10 years ago

@BigCrunsh Thanks for starting working on it. Would you please also try implement the multi-class logistic regression? I think that would show how the optimization procedure for multi-class logistic regression and standard logistic regression are quite different from each other.

See my comments. There are a few efficiency suggestions. But the main issue that I see is that you are currently using symbols to indicate different solvers, that would make the framework difficult to extend. It should be better to introduce RegressionSolver and a set of subtypes to indicate different solvers

abstract RegressionSolver

type SGDSolver <: RegressionSolver 
    # some fields to specify how the learning rates are defined
end

function optimize(solver::SGDSolver, X::AbstractMatrix, y::AbstractVector)
    ....
end

type BFGSSolver <: RegressionSolver end

function optimize(solver::BFGSSolver, X::AbstractMatrix, y::AbstractVector)
    ....
end

Any one can implement his own solver without hacking around the if-else statements.

In terms of generalizing to nonlinear regression, I think @lendle's formulation is to use the following as the risk term.

loss(h(x, \theta), y)

I agree with that we should support this.

Generally, the risk computation consists of two steps:

produce a prediction using h(x, \theta). In linear cases, h(x, \theta) = \theta'x.
measure the deviation between the prediction and response, using the loss function.

These two steps are kind of orthogonal to each other. Using GLM's terminology, h may be called a Predictor.

It is not difficult to generalize the framework by incorporating a predictor in the optimize function. A predictor should be able to produce a prediction value and a derivative w.r.t. \theta. When the predictor is omitted, we fallback to the standard linear cases.

function optimize(solver, predictor, X, y)

function optimize(solver, X, y)   # falls back to linear cases

BigCrunsh commented 10 years ago

@lindahua: yes, I agree. A soon as I have time, I will start with the multi-class logreg. Currently, I just want to see how conceptual things are working. Feel free to join (especially feel free fixing the efficiency issues).

BigCrunsh commented 10 years ago

Regarding the RegressionSolver in principle, I also agree. However, I would like to build up on Optim.jl whenever it is possible in order to avoid double work.

lindahua commented 10 years ago

I agree with first focus on getting things work. I can always help to optimize the performance once your PR is merged.

We can have a BFGSSolver that simply delegates to the BFGS algorithm in Optims, and we can have a SGDSolver also.

In this way, we don't repeat the work already done in Optims, we just use them.

The purpose of different solver subtypes is primarily for the purpose of dispatch. The internal implementation can be delegated to other's functions or implemented by ourselves.

lendle commented 10 years ago

Sorry for belaboring the sum/mean loss issue, but I have a couple more arguments in favor of using the mean. I'll (probably) shut up after this!

Risk interpretation Risk is defined as the mean loss w.r.t. some distribution and for the empirical risk that distribution is the empirical distribution, so by definition ERM is minimizing a mean loss. We could then use risk_and_gradient instead of tloss_and_gradient.

Statistically, our true goal is typically to minimize some true risk on new data

R_0(f_theta) = E_0 \ell(f_\theta, X, Y)

which we attempt to do by minimizing the empirical risk + regularizer

R_n(f_\theta) = E_n \ell(f_\theta, X,Y) + \Omega[f_\theta].

It's appealing to me that the objective we're optimizing numerically, R_n, can be thought of as an estimate of our ideal objective R_0, and that as n increases the expectation of R_n at a given f_\theta is constant. In the sum loss case, the regularization term gets dominated by the sum loss, so the object function is always changing with n.

As I mentioned before, the risk on a mini-batch

R_{S_n}(f_\theta) = E_{S_n}  \ell(f_\theta, X,Y) + \Omega[f_\theta]

can also then be thought of as an estimate of R_n where E_{S_n} is the expectation w.r.t. the empirical distribution of the mini-batch S. This thinking also extends to (cross-)validation, where the risk on an independent validation set V_n,

R_{V_n}(f_\theta) = E_{V_n} \ell(f_\theta, X, Y)

is an unbiased estimate of R_0.

SGD/mini-batch As @BigCrunsh points out, the discrepancy in sum loss in mini-batch vs. batch estimation can be overcome by scaling appropriately, but this requires knowing total sample size n. To me, though, the mean loss is simpler because we don't need to worry about when to scale and when not. The risk+regularizer on any subset is always an approximation of the risk+regularizer on the full data set. Also, I'm not sure if online optimization is within the scope of this package, but in an online setting, we often don't know n ahead of time, so scaling would be impossible.

Relative strength of regularization/MAP interpretation One advantage @lindahua points out is that the sum loss formulation has over mean loss is that the role of the optimizer is reduced automatically as sample size increases. In practice I don't think this matters too much though, because we should really select regularization parameters with something like cross validation where training set sizes are order n. I'm not much of a Bayesian so I don't usually think in terms of MAP estimation, but I do like the interpretation and I think it's useful. If we did go for the mean loss formula, it would not be easy to allow for a MAP interpretation by just scaling the regularizer by 1/n.

Summary: 1) It's easier to be internally consistent with mean loss. Whenever calculating a risk on some collection of m observations, always take the mean then add the regularizer. Don't need to worry about scaling anything after that if m < n. 2) A risk on some set of observations (subset of all, training set, validation set, whatever) can be though of as an estimate of true risk R_0. 3) It's easy to formulate the problem as a MAP problem by scaling the regularizer by 1/n. Could be provided as a kwarg to the Regularizer or RegressionProblem object. This would even be fine with me if it were the default, I just strongly prefer that we use the mean loss internally.

lendle commented 10 years ago

I am not absolutely sure, but isn't a link function just reflected by a specific type of a loss function? If so, it is already captured, right @lendle.

Yes, at least in a link function in the GLM sense.

lindahua commented 10 years ago

@lendle I do see that the mean of loss is useful in many of the cases that you mention, however, sum of losses are also useful, particular in MAP estimation (probably because my background is closely related to Bayesian statistics, I saw a lot of such cases).

My suggestion is to keep both options (preferably via a keyword argument), so that people from different domains all find it useful.

lendle commented 10 years ago

@lindahua That sounds good to me. I forgot to mention that I am also in favor of your suggestion for a kw arg!

BigCrunsh commented 10 years ago

There are even more equivalent formulations of the optimisation problem, which we might want to support. E.g., the optimisation problem

\sum_i \ell(f, x_i, y_i) + \rho_1\Omega[f]

is equivalent to the constraint version

\sum_i \ell(f, x_i, y_i), s.t., \Omega[f]<\rho_2

which is optimized via SGD (pegasos) in https://github.com/JuliaStats/SVM.jl.

robertfeldt commented 10 years ago

Hmm, it seems the interface you are designing assumes that all solvers should be able to solve for all types of loss functions. Isn't this a limitation in case some solvers only work for some subset of loss functions? They might have benefits for some specific loss function for example.

Sorry if I'm coming into the discussion too late or misunderstood anything.

lindahua commented 10 years ago

I don't think that we require each solver to be able to solve everything.

Using multiple dispatch, a solver can implement for a certain family of problems.

JuliaStats / RegERMs.jl

Design of a general API for Regression problems #3