Could we make regularised loss functions independent of the sample size?

jbrea commented 2 years ago

Currently this package does ridge regression with the loss function |Xθ - y|₂²/2 + λ|θ|₂²/2. I would be in favour, however, of |Xθ - y|₂²/2 + λ*n*|θ|₂²/2 where n is the number of samples (see also #81). I find it more intuitive to think of the regularisation constant lambda as a model-specific parameter that is independent of the data size. For example, I would prefer to get the same result for

julia> using MLJ, MLJLinearModels
julia> X, y = make_regression()
julia> X2 = MLJ.table(vcat(fill(MLJ.MLJBase.matrix(X), 5)...)); y2 = vcat(fill(y, 5)...);
julia> mach = fit!(machine(RidgeRegressor(lambda = 1), X, y)); fitted_params(mach)
[ Info: Training Machine{RidgeRegressor,…}.
(coefs = [:x1 => 0.9599766571815262, :x2 => 0.8847156587057124],
 intercept = 0.2977393976887245,)

julia> mach = fit!(machine(RidgeRegressor(lambda = 1), X2, y2)); fitted_params(mach)
[ Info: Training Machine{RidgeRegressor,…}.
(coefs = [:x1 => 0.9661422738548736, :x2 => 0.8912186012159028],
 intercept = 0.29999631435548496,)

But instead one has to scale the regularisation constant in the current version to obtain the same result

julia> mach = fit!(machine(RidgeRegressor(lambda = 5), X2, y2)); fitted_params(mach)
[ Info: Training Machine{RidgeRegressor,…}.
(coefs = [:x1 => 0.9599766571815265, :x2 => 0.8847156587057119],
 intercept = 0.29773939768872437,)

(The same applies also to other models with regularisation.)

If you are up for such a breaking change, I could start a PR.

tlienart commented 2 years ago

Hello @jbrea, yeah it seems like it's a common stumbling point; thanks for the feedback

An issue here is that the package was designed to try to have well separated components and here what you would need is one of the following:

scale the loss function by n --> then it depends on the data,
multiply the penalty by n --> depends on the data

If you look at the constructor for Ridge: https://github.com/JuliaAI/MLJLinearModels.jl/blob/add29849c4f79a340c6d617090341270a52a87f0/src/glr/constructors.jl#L53-L59 you'll see that it calls the generic "GLR" constructor with

loss = L2Loss penalty = L2Penalty * lambda

but at the constructor level, no data has been seen so there's no n (or indeed no p) here.

Use a parameter

What I think we could do that would be (imo) more elegant is to add a boolean parameter average_loss (which we could default to true in a breaking release) so that the constructor for Ridge would become something like

RidgeRegression(...; average_loss=true)

this could be recuperated in the various d_* files in scaling the lambda as per your message so changing this https://github.com/JuliaAI/MLJLinearModels.jl/blob/add29849c4f79a340c6d617090341270a52a87f0/src/glr/d_l2loss.jl#L17

for

λ = getscale(glr.penalty) * ifelse(glr.average_loss, size(X, 1), 1)

I think that should work.

The function objective should also be adjusted accordingly so

this one could be kept and adjusted: https://github.com/JuliaAI/MLJLinearModels.jl/blob/add29849c4f79a340c6d617090341270a52a87f0/src/glr/utils.jl#L20-L21

but this one would probably need to be removed:

https://github.com/JuliaAI/MLJLinearModels.jl/blob/dev/src/glr/utils.jl#L11

If you want to start this, I'll be happy to review and gradually go through this; I think it's doable and potentially desirable.

jbrea commented 2 years ago

@tlienart thanks a lot for the prompt and detailed response. I'll start this asap (but maybe not before the end of the year...).

jbrea commented 2 years ago

Just for completeness: I think the current implementation is problematic in a TunedModel that tunes the regularisation constant with e.g. cross-validation by fitting on a subset of the data and then uses the found optimal value without scaling to fit the whole data set (with train_best = true).

tlienart commented 2 years ago

You mean the suggestion with average_loss not the current implementation right? Also I guess for CV you would want a specialised method anyway (which I unfortunately never got to implementing)

imo the unscaled loss makes more sense for hyper parameter optimisation though but it's maybe just a matter of taste, and of course it would impact the range over which you optimise.

With respect to the issue you opened, would you have another suggestion for addressing it that doesn't require the loss function to depend upon the data?

One quick additional thought is that another approach could be to just rescale the data by sqrt(n) before fitting the model...

jbrea commented 2 years ago

You mean the suggestion with average_loss not the current implementation right?

No, I mean the current implementation. In hyperparameter tuning one fits, say, 4/5 of the data for each fold and finds some optimal lambda. With train_best = true this lambda is used to fit all data at the end. This is wrong. One should use 5/4*lambda when fitting all data. This issue would be solved with the average_loss you suggested above, I think.

With respect to the issue you opened, would you have another suggestion for addressing it that doesn't require the loss function to depend upon the data?

No. But your suggestion above makes sense to me.

Also I guess for CV you would want a specialised method anyway (which I unfortunately never got to implementing)

Is there a specialised method for general CV? I am only aware of specialised methods for LOOCV...

One quick additional thought is that another approach could be to just rescale the data by sqrt(n) before fitting the model...

Not sure this is a good idea, as one would need to mutate the original data (very confusing for the user) or make a copy (not ideal for large datasets).

tlienart commented 2 years ago

Thanks for the clarification and apologies for the confusion. I agree that the scaling is not a good approach.

With respect to specialised methods yes LOO only, funnily enough I wrote a blog post 2y ago on trying to generalise it and why that's not as easy as LOO though you might still want to have a specialised method to recycle singular values... anyway that's for another issue.

Would you like to start a PR with the averageloss approach then? I think I'll have some time over the end of the year for help & reviews, this could be a good addition.

Thanks!

JuliaAI / MLJLinearModels.jl

Could we make regularised loss functions independent of the sample size? #108

Use a parameter