Help with documentation review

ablaom commented 1 year ago

I'm considering having a stab at some of #135 but could do with some help.

What does "EN" mean here?

Screen Shot 2023-01-20 at 3 50 17 PM

This appears in this doc page.

The same doc page gives nice tables of model algorithms but the corresponding MLJ model types are not listed. Would be good to have this, to save some detective work (and the user certainly wants this anyway). To make it easier, I'm copying the lists below:

Regressors	Formulation¹	Available solvers	Comments	Model type
OLS & Ridge	L2Loss + 0/L2	Analytical² or CG³		?
Lasso & Elastic-Net	L2Loss + 0/L2 + L1	(F)ISTA⁴		?
Robust 0/L2	RobustLoss⁵ + 0/L2	Newton, NewtonCG, LBFGS, IWLS-CG⁶	no scale⁷	?
Robust L1/EN	RobustLoss + 0/L2 + L1	(F)ISTA		?
Quantile⁸ + 0/L2	RobustLoss + 0/L2	LBFGS, IWLS-CG		?
Quantile L1/EN	RobustLoss + 0/L2 + L1	(F)ISTA		?

Classifiers	Formulation	Available solvers	Comments	Model type
Logistic 0/L2	LogisticLoss + 0/L2	Newton, Newton-CG, LBFGS	`yᵢ∈{±1}`	?
Logistic L1/EN	LogisticLoss + 0/L2 + L1	(F)ISTA	`yᵢ∈{±1}`	?
Multinomial 0/L2	MultinomialLoss + 0/L2	Newton-CG, LBFGS	`yᵢ∈{1,...,c}`	?
Multinomial L1/EN	MultinomialLoss + 0/L2 + L1	ISTA, FISTA	`yᵢ∈{1,...,c}`	?

Also, could we have a mapping from human name of solver (appearing in table) to Julia object to set as value in the model struct?

@tlienart @jbrea

tlienart commented 1 year ago

EN is Elastic Net, sum of L1 and L2 penalty
Assuming you want input output:
1. all of the regressors are continuous -> continuous\
2. logistic classifiers are continuous -> binary
3. multinomial classifiers are continuous -> multiclass

all of them are deterministic, this repo is purely about finding what people would call MLE or MAP estimator

I think what you want is to know how to set the solver field; a user could (though usually won't) indicate one of the relevant solver defined here: https://github.com/JuliaAI/MLJLinearModels.jl/blob/dev/src/fit/solvers.jl for the appropriate model. So for instance if the column says Analytical or CG then

solver = CG(...)
solver = Analytical(...)

would work for that model.

Analytical -> Analytical(...) (Analytical formula or iterative krylov-style solve that would be very close to analytical)
CG -> CG(...) (Conjugate gradient)\
ISTA -> ISTA(...) (iterative soft thresholding = proximal descent for L1)
FISTA -> FISTA(...) (Fast iterative soft thresholding, = same but with nesterov style acceleration)
Newton -> Newton(...) (newton method with full hessian solve)
NewtonCG -> NewtonCG(...) (same stuff but solving the hessian with CG)
LBFGS -> LBFGS(...) wrapper around Optim.LBFGS
IWLS-CG -> IWLSCG(...) iterative reweighted least sqaure with CG solve

hope that helps, happy to review your stab at this

ablaom commented 1 year ago

@tlienart The current doc strings say something like " if solver=nothing then the default will be used" but don't say what that default is, for each model. Can I get this without digging into the code? Is it always the first one in this table with ISTA the default where it says "(F)ISTA"?

It's a bit annoying that the default isn't the default, instead of nothing if you know what I mean.

I also got confused for a while until I realised ISTA and FISTA were aliases for slow/fast ProxGrad. I was looking for ages for docstrings for ISTA and FISTA but they don't exist. Probably there are other dummies like me who didn't guess this straight away. I will try to address this in my documentation PR.

Ditto CG (alias for Analaytical(iterative=true)).

tlienart commented 1 year ago

defaults

L2Loss, L2Penalty (linear regression, ridge regression) --> default is Analytical() (matrix solve, possibly using an iterative solver)
LogisticLoss, L2Penalty (logistic regression) --> LBFGS()
MultinomialLoss, L2Penalty (multinomial regression) --> LBFGS()
SmoothLoss L2+L1 Penalty (lasso, elasticnet, logistic+multinomial with elastic net) --> FISTA()
RobustLoss, L2Penalty (quantile regression, ...) --> LBFGS()
Other --> error

Alternative solvers a user can specify

LogisticLoss/RobustLoss + L2Penalty --> Newton, NewtonCG
MultinomialLoss + L2Penalty --> NewtonCG
RobustLoss + L2Penalty --> IWLSCG
SmoothLoss + mix L2/L1 --> ISTA

in general the user should not specify these alternatives as they will be inferior to the default (there will be edge cases where this is not true but I don't think these are very relevant for a ML practitioner).

solver parameters with their defaults

`Analytical`

iterative::Bool=false whether to use a cholesky solve or a conjugate gradient (CG) solve
max_inner::Int=200 default number of inner iterations for an iterative solve; will be clamped with the dimension of the problem, i.e. the effective max number of iteration is min(max_inner, p) (https://github.com/JuliaAI/MLJLinearModels.jl/blob/30f7a30f62b6187cf5855c966d2489d71e28a19d/src/fit/analytical.jl#L38)
CG sugar for Analytical(iterative=true)

`Newton`

Solves the problem with a full solve of the Hessian

optim_options can pass a Optim.Options(...) object for things like f_tol (tolerance on objective), see general options
newton_options can pass a named tuple with things like linesearch = ... (see here)

`NewtonCG`

Solves the problem with a CG solve of the Hessian.

Same parameters as Newton except the naming: newtoncg_options

`LBFGS`

LBFGS solve; optim_options and lbfgs_options as per these docs

`ProxGrad`

A user should not call that constructor, the relevant flavours are ISTA (no accel) and FISTA (with accel); ProxGrad is not used for anything else than L1 penalized problems for now.

accel whether to use nesterov style acceleration
max_iter max number of descent iterations
tol tol on the relative change of the parameter
max_inner max number of inner iterations
beta shrinkage of the backtracking step

ISTA is ProxGrad for L1 with accel set to false; FISTA same story but with acceleration.

ISTA is not necessarily slower than FISTA but generally FISTA has a better chance of being faster. A non expert user should just use FISTA.

IWLSCG

Iteratively weighted least square with CG solve

max_iter number of max outer iterations (steps)
max_inner number of steps for the inner solves (conjugate gradient)
tol tolerance on the relative change of the parameter
damping how much to damp iterations should be between (0, 1] with 1 no damping
threshold threshold for the residuals (e.g. for quantile regression)

In general users should not use this. A bit like Newton, NewtonCG above, IWLSCG will typically be more expensive, but it's an interesting tool for people who are interested in solvers and provides a sanity check for other methods.

It's a bit annoying that the default isn't the default, instead of nothing if you know what I mean.

If you have a suggestion for a cleanup, maybe open an issue? (I'm actually not sure I know what you mean)

ablaom commented 1 year ago

L2Loss, L2Penalty (linear regression, ridge regression) --> default is Analytical (matrix solve, possibly using an iterative solver)

What does "possibly" mean? I'm guessing iteration=false for linear and iteration=true for ridge? Is that right?

ablaom commented 1 year ago

And I suppose we can add:

RobustLoss, with L1 + L2 Penalty (RobustRegressor, HuberRegressor) --> LBFGS

Yes?

ablaom commented 1 year ago

L2Loss, L2Penalty (linear regression, ridge regression) --> default is Analytical (matrix solve, possibly using an iterative solver) SmoothLoss L2+L1 Penalty (lasso, elasticnet, logistic+multinomial with elastic net) --> FISTA

Looks like you are saying that the default solver for LogisticClassifier and MultinomialClassifier depends on the value of the regularisation parameters (which would explain the nothing solver default). Is the default only Analytical(...) if L1 penalty is zero, and FISTA otherwise? But now I'm confused because (F)ISTA aren't listed as possible solvers for those models in the current docs.

ablaom commented 1 year ago

I appreciate the help but I'm think I must be asking the wrong questions. Here's what I want to do for each model M:

state clearly in docs what values the field solver may take on, eg, "any instance of LBFGS, ProxGrad".
state clearly what the default value is; if this is "dynamic", ie depends on values of other parameters, then I want a concise statement of the logic needed to determine what solver will be chosen.

Likely all this information is contained in want you are telling me, but I feel I have to "reverse engineer" the answer.

Does this better clarify my needs?

tlienart commented 1 year ago

L2Loss, L2Penalty (linear regression, ridge regression) --> default is Analytical (matrix solve, possibly using an iterative solver)

What does "possibly" mean? I'm guessing iteration=false for linear and iteration=true for ridge? Is that right?

no, both iteration=true/false can be used for either Linear or Ridge. In both cases you just have to solve a positive definite linear system of the form $Mx = b$ (just in Ridge it's perturbed by the identity to shift the spectrum away from zero); to solve such a system you can either do a full solve $M\b$ (using cholsolve) or you can use an iterative method such as conjugate gradient or krylov or whatever. The latter (iterative) can be good when the dimensionality of the problem is large. In general though, users should just use iterative=false, the full backsolve will work very well most of the time.

RobustLoss, with L1 + L2 Penalty (RobustRegressor, HuberRegressor) --> LBFGS

RobustLoss + L2 --> LBFGS RobustLoss + L2 + L1 --> FISTA

Looks like you are saying that the default solver for LogisticClassifier and MultinomialClassifier depends on the value of the regularisation parameters (which would explain the nothing solver default)

As soon as you have a non-smooth penalty such as L1, we cannot use smooth solvers and have to resort to proxgrad. So yes as soon as there's a non-zero coefficient in front of the L1 penalty, a FISTA solver is picked.

But now I'm confused because (F)ISTA aren't listed as possible solvers for those models in the current docs.

state clearly in docs what values the field solver may take on, eg, "any instance of LBFGS, ProxGrad". state clearly what the default value is; if this is "dynamic", ie depends on values of other parameters, then I want a concise statement of the logic needed to determine what solver will be chosen.

isn't what I quoted in my previous answer under defaults what you wanted?

Maybe to simplify (I'm aware you have limited bandwidth and that it's not helping to have a long conversation), how about we do this just for Linear+Ridge in a draft PR, we get to a satisfactory point and then we progress from there?

MLJ constructors:

LinearRegressor, RidgeRegressor

for both the solver can be specified to be Analytical(...). The default is Analytical(). Difference with default is if the user passes ;iterative=true in which case they may also specify max_inner

ablaom commented 1 year ago

@tlienart Thanks for the additional help and your patience. #138 is now ready for your review.

JuliaAI / MLJLinearModels.jl