JuliaAI / MLJLinearModels.jl

Generalized Linear Regressions Models (penalized regressions, robust regressions, ...)
MIT License
80 stars 13 forks source link

MultinomialClassifier does not seem consistent #102

Closed olivierlabayle closed 3 years ago

olivierlabayle commented 3 years ago

Hello,

I'm currently trying to use the LogisticClassifier in a multiclass setting. If I'm correct, using a synthetic dataset with a softmax, the model should be able to retrieve the correct coefficients is that right? If so, this does not seem to be the case. Moreover, the result returned by fitted_params only returns the n-1 coefficients vectors even when fit_intercept is set to false. Below is a sample code to reproduce the issue. Let me know if there is anything wrong with either my understanding or the use case.

Thanks

using MLJ
using StatsBase
using StableRNGs

LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels

rng = StableRNG(123)

# Dataset sampled according to a softmax
n = 10000
X = rand(rng, n, 3)
θ = [ 1 2 -3 4;
     2 -3 2 -2;
     -1 3 -1 2]
M = exp.(X*θ)
mu = M  ./ sum(M, dims=2)
y = categorical([sample(rng, [1, 2, 3, 4], Weights(mu[i, :])) for i in 1:n])

# Fitting
mach = machine(LogisticClassifier(fit_intercept=false, lambda=0, gamma=0), MLJ.table(X), y)
fit!(mach)
fp = fitted_params(mach)

# Inspection
# Only 8 params
fp.coefs
# Here 12 params but don't seem to match the truth while the model is correctly specified
mach.fitresult[1]
tlienart commented 3 years ago

Ok I think there's a few issues with your example, below some code that is purely MLJLM to make things simpler and less confusing.

One main issue you have in your code is that your dimensions don't match what's expected. The model in the package is that X is n x p where n is the number of points and p the number of classes. Correspondingly, the parameters (when there's no intercept) are p x c where c is the number of classes.

There's a mixup in your code with 3s and 4s as far as I can see.

The code below does a fit of something like your example:

Generate points

I don't use StatsBase just so that we remove confounding factors

using StableRNGs, MLJLinearModels
rng = StableRNG(123)

n = 10_000
p = 4           # number of features
c = 3           # number of classes
X = rand(rng, n, p)
θ = permutedims([ 1 2 -3 4;
      2 -3 2 -2;
     -1 3 -1 2])          # this is now p * c, note the permutedims
M = exp.(X * θ)
Mn = M ./ sum(M, dims=2)

# basic sampling from a multinomial (you can ignore that if you want)
be = reshape(rand(rng, n * c), n, c)
y  = zeros(Int, n)
@inbounds for i in eachindex(y)
    rp = 1.0
    for k in 1:c-1
        if (be[i, k] < Mn[i, k] / rp)
            y[i] = k
            break
        end
        rp -= Mn[i, k]
    end
end
y[y .== 0] .= c

Fit stuff and compute the objective

λ = 1

mnr = MultinomialRegression(λ; fit_intercept=false)
J = objective(mnr, X, y; c=c)
θ̂ = fit(mnr, X, y)

@show J(reshape(θ, p*c))
@show J(θ̂)

θ̂_reshaped = reshape(θ̂, p, c)

the last item has dimensions p x c (after reshape) with the same meaning as the generating theta. the J(...) computes the objective function corresponding to the classifier

Note: you can do the same stuff with an intercept (just remove the fit_intercept=false) it won't change much, except that you'll need to reshape to (p+1, c)

Running this I get that the objective for the generating parameter is 7331.75 and that for the fitted parameters is 7322.00 (i.e., slightly less, but that doesn't matter much, what matters is that they're close, so in terms of ML we're doing a good job).

Are the parameters close? no. Should they be? not really no. The only thing that matters is the objective function and finding a set of parameters that minimises it. You might see a convergence if you generate huge quantities of points but I think there'll be numerical issues before you see what you hope to see.

Maybe I'm not credible, how about scikit learn?

Scikit Learn is a pretty mature package, maybe that we can check what they get and see whether that looks more reasonable than what MLJLM gives, maybe that can act as some form of credible arbiter


# using Conda
# Conda.add("scikit-learn")

using PyCall
SKLEARN_LM = pyimport("sklearn.linear_model")

mnr_sk = SKLEARN_LM.LogisticRegression(
            C=1, multi_class="multinomial", fit_intercept=false,
            random_state=555)
mnr_sk.fit(X, y)

θ_sk = vec(mnr_sk.coef_')
@show J(θ_sk)

θ_sk_reshaped = reshape(θ_sk, p, c)

maximum(abs.(θ̂_reshaped - θ_sk_reshaped) / maximum(θ_sk_reshaped)) * 100

if you run this, the last command will give you something under 5% indicating that the relative difference between what MLJLM gives and what SKL finds is pretty close. Are any of those close to the generating theta? still no.

What about the objective?

So if you're inclined to believe SKL then basically MLJLM does something very similar. I'm not saying MLJLM is perfect (and if anything having better docs might have prevented your mixup of dimensions) but I think that for the basic case it's doing an ok job.

I hope this clarifies stuff a bit!

olivierlabayle commented 3 years ago

Hi Thibault and thank you for your quick answer!

Obviously there is something I misundertstand about the multinomial regression. Maybe you can clarify it further, that would be much appreciated.

I don't understand why you say there is a mismatch in dimensions: I intended X to be (n,3) and y from 4 categories. With that in mind I expect the coefficients to be represented by a (3, 4) matrix, or (4,3) depending on how you represent it (considering that I set fit_intercept=false). But the fitted_params function only returns 8 in the example I provide (while the fitresult[1] has 12 as I would have expected).

My belief is that we are doing maximum likelihood estimation, so if the model is correctly specified (not overly parameterized with an intercept for instance and no regularization), the estimator should be consistent and converge to the truth. As you show, It seems not to be the case. I don't really understand why this is the case, maybe the loss function used is not the negative loglikelihood?

Thanks for your help!

tlienart commented 3 years ago

It might have been me getting confused by your code. Using p=3 and c=4 is perfectly fine of course, you can run the same code as I provided and just remove permutedims, everything else will hold.

In terms of dimensions, with fit_intercept = false you'll get a vector of 12 elements which is p * c, you reshape it as reshape(theta, p, c). if you do fit_intercept=true you'll get a vector of 16 elements which is (p+1) * c.

In terms of the fit, I think we're probably talking past each other. If you use a super simple model (like a linear regression with two parameters), and you use a huge number of points with a centred distribution then eventually yes, you will get a convergence to the generating parameters. This is what the theory tells you. But in practice you never really get to that regime as soon as the dimensionality of the problem starts increasing. This doesn't mean that anything is "wrong", it's just that the objective function guides you to anything that has a good error metric, and as I was hoping to show above it's often the case that your generating parameters give a worse objective than the fitted parameters when you have finitely many points.

Maybe I could have made my point simpler by just saying: if you want to check whether a package is doing something reasonable, the only thing you can effectively check is whether the objective function is minimised. Checking whether the recovered parameters are close to the generating ones is generally not. This is why I showed you the comparison with the objective function you get from here and from sklearn to show that you get an identical objective.

PS: yes the loss function is the negative log likelihood, you can inspect this here if you'd like: https://github.com/JuliaAI/MLJLinearModels.jl/blob/6d4b4886ba0494793de8d256c1ecdabef0c8beaa/src/loss-penalty/standard.jl#L95-L129

olivierlabayle commented 3 years ago

Sorry, I realize the title of my issue might sound aggressive, it is not intended as so and I don't mean that the package is broken in any way. That being said:

using MLJ
using StatsBase
using StableRNGs
using MLJLinearModels
using Plots

LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels

rng = StableRNG(123)

θ = [ 1 2 -3 4;
     2 -3 2 -2;
     -1 3 -1 2]

all_fitted_params = []
ns = convert(Vector{Int}, [1e3, 1e4, 1e5, 1e6, 1e7])

for n in ns
    # Data generation
    X = rand(rng, n, 3)
    M = exp.(X*θ)
    mu = M  ./ sum(M, dims=2)
    y = categorical([sample(rng, [1, 2, 3, 4], Weights(mu[i, :])) for i in 1:n])
    # Fitting
    mach = machine(LogisticClassifier(fit_intercept=false, lambda=0, gamma=0), MLJ.table(X), y)
    fit!(mach)
    # Parameter extraction
    params = mach.fitresult[1]
    push!(all_fitted_params, params)
end

distance(θ₁, θ₂) = maximum(abs.(θ₁ - θ₂) / maximum(θ₂)) * 100

distances_to_asymptotic = [distance(p, all_fitted_params[end]) for p in all_fitted_params]

distances_to_truth = [distance(reshape(p, :, 4), θ) for p in all_fitted_params]

plot(ns, distances_to_asymptotic, label="Distance to the asymptotic limit")
plot!(ns, distances_to_truth, label="Distance to the truth")
asymptotics
tlienart commented 3 years ago

Sorry, I realize the title of my issue might sound aggressive, it is not intended as so and I don't mean that the package is broken in any way. That being said:

no worries at all, I didn't take it as such, sorry if I was sounding too defensive

  • Regarding the fitted_params for the MLJ interface function: looking at this line shouldn't W be passed instead of W[1:end-1, :] when fit_intercept=false?

Yes! thanks I got sidetracked in the discussion and should have also checked the interface was correct there. Thanks for this, fix on its way.

I'll close this for now but please reopen if you'd like to discuss some things further