Multinomial regression has too many parameters

JuliaAI / MLJLinearModels.jl

Generalized Linear Regressions Models (penalized regressions, robust regressions, ...)

MIT License

81 stars 13 forks source link

Multinomial regression has too many parameters #105

Closed JockLawrie closed 3 years ago

JockLawrie commented 3 years ago

Hi there,

Love this package.

I've been trying a 3-class multinomial regression on the Iris data set, with an intercept and without penalty terms. The model trains and gives a result that yields reasonable predictions. The parameter matrix has dimension (p+1) x c, as mentioned in the docs. Shouldn't the first column of this matrix be full of zeroes, and the remaining columns be reduced element-wise by the first column? As it stands the model is not identifiable, and takes longer to estimate than it should because it has more parameters than required. If the extra/unnecessary dimensions are removed then the loss function has a unique global minimum and convergence should be quick.

I tried looking under the hood at the code that implements the loss function but struggled to find it. Can you point me to it?

Cheers, Jock

tlienart commented 3 years ago

Does this help?

import MLJBase: @load_iris
import DataFrames: DataFrame
import MLJLinearModels: MultinomialRegression, fit, objective

X, y = @load_iris

Xdf  = DataFrame(X)
Xmat = Matrix(Xdf)
yenc = Int.(y.refs)  # ugly shortcut, don't use it

n = size(Xmat, 1)  # number of data points
p = size(Xmat, 2)  # number of features
c = 3              # number of classes

MR = MultinomialRegression(
        lambda=0, gamma=0, fit_intercept=true, nclasses=c
)

θ = fit(MR, Xmat, yenc)

θ_mat = reshape(θ, (p+1, c))

# Loss
J = objective(MR, Xmat, yenc; c=c)
J(θ)

# ----

import PyCall: pyimport

sklm = pyimport("sklearn.linear_model")
skmr = sklm.LogisticRegression(
        C=Inf, multi_class="multinomial"
)
skmr.fit(Xmat, yenc)
θsk = vec(vcat(skmr.coef_', skmr.intercept_'))

J(θsk)

J(θ) - J(θsk)  # Note: -2e-8 on my side

θsk_mat = vcat(skmr.coef_', skmr.intercept_')

#=
On my side:
(note that the fact it's not identical is irrelevant, the
two solutions have similar loss, but you can see that the
two are comparable in terms of amplitude)

julia> θ_mat
5×3 Matrix{Float64}:
   9.61103  -3.5729     -6.03812
  25.686    -9.50255   -16.1834
 -38.8569   14.7137     24.1431
 -18.0646   -0.110761   18.1754
   5.28935  18.6742    -23.9636

julia> θsk_mat
5×3 Matrix{Float64}:
   8.02085  -2.77794   -5.24291
  21.0483   -7.18336  -13.8649
 -30.7568   10.6635    20.0933
 -14.4619   -1.91285   16.3747
   4.16734  19.2373   -23.4047
=#

Shouldn't the first column of this matrix be full of zeroes, and the remaining columns be reduced element-wise by the first column?

I don't understand this question.

There's c columns, one for each class; each column has p+1 rows corresponding to the weights for each feature with an extra weight for the intercept. This is how multinomial regression parameters are, why would there be a column of zeros?

The formula for MR can be done different ways but always amounts to

scores_per_class = f ( [X 1] θ )

where [X 1] is the n x (p+1) matrix with the last column a column of ones for the intercept, θ is the parameter to find and has dimension (p+1) x c and f is a function that get scores for each class so that scores_per_class is a matrix of size n x c with rows that sum to 1 and a score for each class.

In terms of pointers in the code I guess her: https://github.com/JuliaAI/MLJLinearModels.jl/blob/abf145f86768fe24f1804bbe852e6bb6989c60c1/src/loss-penalty/standard.jl#L95-L129

JockLawrie commented 3 years ago

The model is over-parameterised, which slows convergence and not yielding reproducible results because there are multiple solutions. By cutting out the extra parameters we can get the same parameters, and therefore the same loss, on every call to fit, and faster.

The issue here is how the matrix P in your documentation is calculated in the first place.

For a vector of predictors x (including a constant for the intercept) and p x c parameter matrix B, the probabilities are calculated as [exp(b1.x)/D, ..., exp(bc.x)/D], where the bj are the columns of B and D = exp(b1.x) + ... + exp(bc.x).

Divide numerators and denominators by exp(b1.x). Our probabilities are numerically identical, but calculated as [1/G, ..., exp(ac.x)/G], where aj .= bj .- b1 and G = 1 + ... + exp(ac.x). Notice that only c - 1 categories have an associated vector of parameters aj.

That is, we need only estimate the p x (c - 1) matrix A and not B. That's p fewer parameters for the same numerical result. It also means a loss function with lower dimension and a unique global minimum, so convergence will be faster and unique.

Here's an implementation, which assumes the usual checks have been done (c>1, no missing data, no collinearity, etc). It uses the first category as the reference category (a1 set to 0), but it could be any category. Note that the first loop starts at 2.

function update_probs!(probs, A, x)
    # Populate probs with the ax and find their maximum
    probs[1] = 0.0  # ax for first category is 0
    max_ax   = 0.0
    nclasses = length(probs)
    for j = 2:nclasses
        ax       = dot(x, view(A, :, j-1))
        probs[j] = ax
        max_bx   = ax > max_ax ? ax : max_ax
    end
    # Calculate the numerators and the denominator of the probabilities
    denom = 0.0
    for j = 1:nclasses
        val = exp(probs[j] - max_ax)  # Subtract max_ax first for numerical stability (then val <= 1)
        val = isnan(val) || val == -Inf ? 0.0 : val  # Handle remaining numerical instability
        probs[j] = val  # Numerator
        denom   += val
    end
    probs ./= denom  # Normalise. We know that denom >= 1, so this will work.
end

tlienart commented 3 years ago

Ah right I see what you're saying now, thanks & sorry for the confusion; this is also explained in e.g http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/ . I think this only makes sense when you're not using regularisation though right? and you would typically use regularisation for such problems (e.g. in sklearn the regularisation is basically always on unless you pass C=float(inf)). I need to think a bit more about it though. There may be value in having a specialised case when there's no penalty which follows the line of what you're saying; I haven't had the time to think in depth about whether there is some extension in the case of a penalty, my hunch is that there isn't but I might be wrong

JockLawrie commented 3 years ago

I haven't thought much about penalty terms. Guessing that L2 is ok because it's a convex function, therefore LL + L2 is also convex and has a unique minimum. Not sure about L1, maybe just some weirdness near the zero of each dimension but otherwise ok. Dunno.

In any case the probs are calculated as normalised exp terms and the math above applies.

On Wed, 11 Aug. 2021, 21:53 Thibaut Lienart, @.***> wrote:

Ah right I see what you're saying now, sorry for the confusion; this is also explained in e.g http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/ . I think this only makes sense when you're not using regularisation though right? and you would typically use regularisation for such problems (e.g. in sklearn the regularisation is basically always on unless you pass C=float(inf)). I need to think a bit more about it though. There may be value in having a specialised case when there's no penalty which follows the line of what you're saying; I haven't had the time to think in depth about whether there is some extension in the case of a penalty, my hunch is that there isn't but I might be wrong

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaAI/MLJLinearModels.jl/issues/105#issuecomment-896761964, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXLF5TQVCREW7EEMKPPWRDT4JQEHANCNFSM5B5QSFMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

JockLawrie commented 3 years ago

Also, anchoring a category's parameters at 0 has a regularising effect

On Wed, 11 Aug. 2021, 22:09 Jock Lawrie, @.***> wrote:

I haven't thought much about penalty terms. Guessing that L2 is ok because it's a convex function, therefore LL + L2 is also convex and has a unique minimum. Not sure about L1, maybe just some weirdness near the zero of each dimension but otherwise ok. Dunno.

In any case the probs are calculated as normalised exp terms and the math above applies.

On Wed, 11 Aug. 2021, 21:53 Thibaut Lienart, @.***> wrote:

Ah right I see what you're saying now, sorry for the confusion; this is also explained in e.g http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/ . I think this only makes sense when you're not using regularisation though right? and you would typically use regularisation for such problems (e.g. in sklearn the regularisation is basically always on unless you pass C=float(inf)). I need to think a bit more about it though. There may be value in having a specialised case when there's no penalty which follows the line of what you're saying; I haven't had the time to think in depth about whether there is some extension in the case of a penalty, my hunch is that there isn't but I might be wrong

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaAI/MLJLinearModels.jl/issues/105#issuecomment-896761964, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXLF5TQVCREW7EEMKPPWRDT4JQEHANCNFSM5B5QSFMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

tlienart commented 3 years ago

no I don't think that's right (and fixing a parameter is definitely not regularisation), also the convexity is irrelevant here, it's convex even in the 'overparametrized' setting. There may be a level set of minimisers but that's also irrelevant as long as we find one that has an optimal (or near optimal) loss which is what you'll get.

What can matter is that without penalty and the overparametrized world, the hessian is not strictly PD which can cause numerical troubles for newton-style solver, it's not the default solver nor the default setting though.

The convergence speed advantage is not guaranteed either (though it may be true in practice, this would have to be empirically checked)

And finally my point with it not holding with penalty is with the L2 penalty the objective function when you fix a parameter is not the same, so you would get different problems. Whether that matters or not, I don't know, I'd need to check this a bit carefully, it's not immediately clear to me.

JockLawrie commented 3 years ago

It's less a matter of fixing some parameters at 0 and more about the dimension of the problem. We could have a second intercept in a linear regression and then regularise, but it complicates the problem and adds no value.

On Wed, 11 Aug. 2021, 22:19 Thibaut Lienart, @.***> wrote:

no I don't think that's right (and fixing a parameter is definitely not regularisation), also the convexity is irrelevant here, it's convex even in the 'overparametrized' setting. There may be a level set of minimisers but that's also irrelevant as long as we find one that has an optimal (or near optimal) loss which is what you'll get.

What can matter is that without penalty and the overparametrized world, the hessian is not PD which can cause numerical troubles for newton-style solver, it's not the default solver though.

The convergence speed advantage is not guaranteed either.

And finally my point with it not holding with penalty is with the L2 penalty the objective function when you fix a parameter is not the same, so you would get different problems. Whether that matters or not, I don't know, I'd need to check this a bit carefully.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaAI/MLJLinearModels.jl/issues/105#issuecomment-896778646, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXLF5TWMCF7VWN25DVWWBTT4JTEZANCNFSM5B5QSFMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

tlienart commented 3 years ago

I think we might be talking past each other, here's the summary as I see it:

for un-regularised Multinomial regression, I agree with what you said, the model is overparametrised and can be simplified, this may lead to faster convergence (though no guarantees) and that's about it
for regularised MNR, the objective function is not overparametrised anymore as far as I'm aware, and so (AFAIA) cannot be simplified, this is for any standard penalty function including the "just L2" one which is the most common
MNR (and in fact LR) are typically used with L2 regularisation at a minimum, this is because you can easily get numerical troubles when your data is poorly conditioned leading to poorly generalising model (which is what we care about), for standard packages like sklearn the L2 regularisation is applied by default (this is also the case here)

for (2) after doing some back of the envelope checks, I'm reasonably confident that you cannot simplify the objective function in the same way, so you could only provide a "reduced dimensionality" solver for the no-penalty MNR, this could be done, I'm not entirely sure how useful it is though beyond maybe the educational value that could be put in the docs since it's an interesting fact.

There's a subsequent question which could also be interesting and which is to check whether if you do take the un-regularised MNR and train it with or without this over-parametrisation, whether the resulting predictive model generalises any differently. My hunch is that it won't matter much but I may be wrong. I'd also be interested in seeing what changes if you pick a different base class (the hope would be that this doesn't matter too much otherwise it would be worrying).

JockLawrie commented 3 years ago

Yeah we have been talking past each other a bit, but we're converging.

Agreed.
The dimension of the problem is the number of parameters required to relate the input x to the output y. Since the penalty terms don't include input, they shouldn't have larger dimension than the parameters that relate the input to the output. That is, regularised MNR is still over-parameterised. Looks like the LogisticLoss has the right dimensionality, which holds with or without penalty terms. MNR should reduce to the same in the case of 2 classes.
Yep, I never suggested not regularising. Agree that we care about a model that generalises beyond the training data. To do that we need parameters that aren't too large, and not too many parameters. Regularisation is for constraining parameter sizes, not for introducing more parameters (except hyper parameters like lambda). It just directs the solver to a different point in parameter space, rather than increasing the dimension of the space. Therefore the regularised and unregularised versions will generalise differently. But that's orthogonal to the issue here, which is the dimension of the space.

The choice of the reference class makes no difference.

tlienart commented 3 years ago

Let's start with (2) would you mind trying to prove it formally? Ie write the loss function with l2 penalty and show that it's identically minimised by a vector x and x - c for an arbitrary constant c

JockLawrie commented 3 years ago

I don't follow. From earlier we are comparing B to A[:, j] = B[:, j] - B[:, 1], so I don't understand comparing the loss function evaluated at x and x - c.

What are we trying to show? Comparing penalty terms defined on spaces with different dimensions and different meanings doesn't make sense. Perhaps we show that the tests pass using a version that estimates A instead of B?

tlienart commented 3 years ago

The multinomial loss without penalty has the property that J(x) = J(x-c) for an arbitrary constant c and, in particular, if you take c to be one of the weight (this is what you're using though you might be looking at this differently), see http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/ the part "properties of the softmax regression" if you want more on this.

What I'm saying is simple: with the penalty the overparametrisation disappears (you cannot arbitrarily set coefficients to zero because the L2 norm of the parameter vector will change and therefore the objective will be different).

The options to move this conversation forward are as follows:

(1) show a formal proof that the penalised loss function has the same value for x or for x-c (where you take c to be one of the coefficient), or if you prefer a numerical experiment: (2) adapt your solver to work with penalty, pick two different base classes and compare the loss (according to you, the loss should be identical in that case too).

Where I believe the confusion lies is that what you're using initially is that the scores sum up to one, i.e. a transformed form of the coefficients, and for the non-penalised objective indeed you can leverage this to reduce the dimensionality of the problem. This goes away when you penalise because you penalise the coefficients and not the scores.

JockLawrie commented 3 years ago

Ah I misinterpreted c to be a vector of the same number repeated, not just a constant vector.

But we're still talking past each other. I'm not suggesting J(x) = J(x-c), though it happens to be in the non-penalised case. There's no reason for them to be equal in the penalised case - they're 2 different spaces, related but not the same.

The point I'm making is best made by referring to the 2-class case. We have a vector of parameters b1 for exactly 1 of the classes, and the solver does its thing for both the penalised and unpenalised cases. There's no need to introduce a new vector b2 for the other class and rewrite the likelihood function and penalty terms to include both b1 and b2. You say that this amounts to arbitrarily setting b2 to 0. I say that b2 is not part of this problem. Why would it be there? If we're going to introduce b2, why does it have length p? Why not 2p or 1000p? b2 is arbitrary.

Put another way, why does the 2-class version of the multinomial code use B but the logistic regression code uses A?

tlienart commented 3 years ago

Ok with your last point I think I understand where there's confusion. If I may rephrase what you're saying, you're saying that using the MultinomialRegression with penalty for 2 class should be the same as the LogisticRegression with penalty; the latter only has one set of parameter for one of the two classes, and so the same should apply for the MNR; the extension of this is that if you have K classes you only really need to estimate a matrix of (K-1) parameters.

This link is faulty however in that in fact the two are not identical. If you did write a MultinomialRegression with 2 classes and penalty, you would not get an identical result to that of a LogisticRegression with the same penalty. It's just not the same objective function therefore you get a different solution (*). Of course I don't suspect the result will behave drastically different, but it's still not the same thing.

() note: of course it is* the same objective function when there's no penalty.

If you take the reverse of that, and extend your line of thinking where for K classes you would estimate (K-1)x(p+1) parameters and penalise those, I think you would get something reasonable as well, it just wouldn't be the classical MNR with penalty that people consider.

So long story short, I think I fully understand what you're saying, the issue is now purely in terms of definition, what model corresponds to what objective function. In the case of the LogisticRegression, there's only one set of parameters for one of the two classes, this is how it's defined, you could have an alternative model called LogisticRegression2 with one set of parameters for each class, penalise that, get a unique solution, and that would probably be a decent model in its own right, it's just not what people use (note: in intro to deep learning class, that's probably what people would do so that they can more easily introduce the softmax).

Same for MNR, it has one set of parameters for each class, this is how it's defined, with a penalty this is identifiable; you could have an alternative model called MNR-(K-1) with fewer dimensions and penalise only those dimensions, that would probably be a decent model, but it's not what people use.

Would you agree with this?

JockLawrie commented 3 years ago

Yes agreed. It begs the question, why use the version with more parameters? They play no part in relating x to y, that is, they provide no extra predictive power. I don't see any advantages.

On Thu, 12 Aug. 2021, 18:58 Thibaut Lienart, @.***> wrote:

Ok with your last point I think I understand where there's confusion. If I may rephrase what you're saying, you're saying that using the MultinomialRegression with penalty for 2 class should be the same as the LogisticRegression with penalty; the latter only has one set of parameter for one of the two classes, and so the same should apply for the MNR; the extension of this is that if you have K classes you only really need to estimate a matrix of (K-1) parameters.

This link is faulty however in that in fact the two are not identical. If you did write a MultinomialRegression with 2 classes and penalty, you would not get an identical result to that of a LogisticRegression with the same penalty. It's just not the same objective function therefore you get a different solution. Of course I don't suspect the result will behave drastically different, but it's still not the same thing.

If you take the reverse of that, and extend your line of thinking where for K classes you would estimate (K-1)x(p+1) parameters and penalise those, I think you would get something reasonable as well, it just wouldn't be the classical MNR with penalty that people consider.

So long story short, I think I fully understand what you're saying, the issue is now purely in terms of definition, what model corresponds to what objective function. In the case of the LogisticRegression, there's only one set of parameters for one of the two classes, this is how it's defined, you could have an alternative model called LogisticRegression2 with one set of parameters for each class, penalise that, get a unique solution, and that would probably be a decent model in its own right, it's just not what people use. Same for MNR, it has one set of parameters for each class, this is how it's defined, with a penalty this is identifiable; you could have an alternative model called MNR-(K-1) with fewer dimensions and penalise only those dimensions, that would probably be a decent model, but it's not what people use.

Would you agree with this?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaAI/MLJLinearModels.jl/issues/105#issuecomment-897466975, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXLF5TZW5EEL6UHLQHH5N3T4OELBANCNFSM5B5QSFMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

tlienart commented 3 years ago

Maybe just because it's simpler to interpret, there's a direct connection to the Multinomial distribution etc (same as for the logistic there's a direct connection with the Binomial)

Maybe an interesting experiment would be to check whether the model where you pick fewer dimensions and regularise generalises better, worse, or similar to the one with more dimensions. My hunch in this respect is that it doesn't matter much otherwise people would be using what you're suggesting by default.

There's a larger discussion that overparametrised model in ML don't necessarily do worse and actually often do better (for NNs), I'm not going to make a big connection though, here I just think it's out of ease with the connection to the multinomial distribution and that's about it.

Anyway I'll close this for now as I don't think there's much more to say, the conclusion, as I think you'll agree, is:

MNR as is implemented in the current package, corresponds to the "standard" MNR where you have one parameter vector per class, when you use a penalty, this is identifiable
MNR in the case where there is no penalty is equivalent to a reduced-dimensionality problem which could be implemented
LR corresponds to the reduced dimensionality problem, the penalty is only on the parameter of "one" class
There could be a MNR-2 to compare with LR, it would be a different objective, probably similar in terms of predictive power

One thing that could be done here further to this, in this package, is

implement a MNR-2
implement a reduced-MNR where you pick a base class and do everything relative to it, you penalise the parameters of the other classes (~ extension of LR)

if you're interested in either 1 or 2 I can point you to the relevant bits of code that would have to be written. Thanks for the discussion!