cov-only training for L2Loss

adienes commented 1 year ago

When X cannot fit in memory but X'X and X'y can be constructed in a distributed fashion, it would be incredibly useful to have solvers that can train on solely the covariances. Ridge can be done by solving a quadratic program over X'X + c*diagonal(n) but as far as I know there are no capabilities in Julia to do this with Lasso or ElasticNet besides passing in the problem to a generic optimizer.

As a bonus, I believe this is also pretty hard to do in Python. The best I could find was sklearn.linear_model.lars_path_gram which frequently gives me convergence issues, so there are bragging rights up for grabs :)

This is something I am willing to help work on if it is feasible given the current framework.

tlienart commented 1 year ago

if we ignore acceleration for now, the proximal gradient step for Lasso which is the core is like this:

compute the gradient of the standard L2
compute the prox step

The gradient is the usual X'(Xb - y) which in your case according to what you said should be computed as Cb - z where C=X'X and z = X'y have been pre-computed.

The prox step is just a shrinkage of the current candidate of size p, the dimensionality of your problem, which should be manageable given what you describe.

The steps to get this done in this package if that's something you'd like to try would be:

add some custom fit method here which takes whatever object you use to represent the map of Cx, (e.g. https://github.com/JuliaLinearAlgebra/LinearMaps.jl); so instead of passing X, y, you'd pass C, z.
add a dedicated solver file here using proxgrad.jl as a template; you can leave everything the same except the computation of _f and _fg! (https://github.com/JuliaAI/MLJLinearModels.jl/blob/355edad72978b01586567c78131ce276b970632e/src/fit/proxgrad.jl#L21-L23) which would need to make use of C and z; to help you with that, look at smooth_fg! for the L2Loss here you'd have to add one specialising over the type you use for C.

adienes commented 1 year ago

Thank you for the outline; this would be a pretty great feature for me so I will give it a shot.

One thing I don't fully understand: why make a new type for X'X ? Shouldn't it just be a Matrix or whatever the same type that X is? I guess this is just to control dispatch, but then doesn't that mean I'd have to manually forward all methods of Gramian (or whatever the type would) to Matrix ? It seems a little simpler to have some kind of flag in the function like gram=true or data=:gramian

tlienart commented 1 year ago

I didn't make the assumption that you could get the actual Z=X'X matrix, rather that you could compute the application of it ie Zx for x. This is all you need.

Anyway you could very well just do all I've said with the assumption that you have Z explicitly of course

JuliaAI / MLJLinearModels.jl

cov-only training for L2Loss #145