Closed ablaom closed 1 year ago
all of them are deterministic, this repo is purely about finding what people would call MLE or MAP estimator
solver
field; a user could (though usually won't) indicate one of the relevant solver defined here: https://github.com/JuliaAI/MLJLinearModels.jl/blob/dev/src/fit/solvers.jl for the appropriate model. So for instance if the column says Analytical or CG then solver = CG(...)
solver = Analytical(...)
would work for that model.
Analytical(...)
(Analytical formula or iterative krylov-style solve that would be very close to analytical)CG(...)
(Conjugate gradient)\ISTA(...)
(iterative soft thresholding = proximal descent for L1)FISTA(...)
(Fast iterative soft thresholding, = same but with nesterov style acceleration)Newton(...)
(newton method with full hessian solve)NewtonCG(...)
(same stuff but solving the hessian with CG)LBFGS(...)
wrapper around Optim.LBFGS
IWLSCG(...)
iterative reweighted least sqaure with CG solvehope that helps, happy to review your stab at this
@tlienart The current doc strings say something like " if solver=nothing
then the default will be used" but don't say what that default is, for each model. Can I get this without digging into the code? Is it always the first one in this table with ISTA the default where it says "(F)ISTA"?
It's a bit annoying that the default isn't the default, instead of nothing
if you know what I mean.
I also got confused for a while until I realised ISTA
and FISTA
were aliases for slow/fast ProxGrad
. I was looking for ages for docstrings for ISTA
and FISTA
but they don't exist. Probably there are other dummies like me who didn't guess this straight away. I will try to address this in my documentation PR.
Ditto CG
(alias for Analaytical(iterative=true)
).
Analytical()
(matrix solve, possibly using an iterative solver)LBFGS()
LBFGS()
FISTA()
LBFGS()
Alternative solvers a user can specify
in general the user should not specify these alternatives as they will be inferior to the default (there will be edge cases where this is not true but I don't think these are very relevant for a ML practitioner).
Analytical
iterative::Bool=false
whether to use a cholesky solve or a conjugate gradient (CG) solvemax_inner::Int=200
default number of inner iterations for an iterative solve; will be clamped with the dimension of the problem, i.e. the effective max number of iteration is min(max_inner, p)
(https://github.com/JuliaAI/MLJLinearModels.jl/blob/30f7a30f62b6187cf5855c966d2489d71e28a19d/src/fit/analytical.jl#L38)CG
sugar for Analytical(iterative=true)
Newton
Solves the problem with a full solve of the Hessian
optim_options
can pass a Optim.Options(...)
object for things like f_tol
(tolerance on objective), see general optionsnewton_options
can pass a named tuple with things like linesearch = ...
(see here)NewtonCG
Solves the problem with a CG solve of the Hessian.
Same parameters as Newton
except the naming: newtoncg_options
LBFGS
LBFGS solve; optim_options
and lbfgs_options
as per these docs
ProxGrad
A user should not call that constructor, the relevant flavours are ISTA (no accel) and FISTA (with accel); ProxGrad is not used for anything else than L1 penalized problems for now.
accel
whether to use nesterov style accelerationmax_iter
max number of descent iterationstol
tol on the relative change of the parametermax_inner
max number of inner iterationsbeta
shrinkage of the backtracking stepISTA is ProxGrad for L1 with accel set to false; FISTA same story but with acceleration.
ISTA is not necessarily slower than FISTA but generally FISTA has a better chance of being faster. A non expert user should just use FISTA.
Iteratively weighted least square with CG solve
max_iter
number of max outer iterations (steps)max_inner
number of steps for the inner solves (conjugate gradient)tol
tolerance on the relative change of the parameterdamping
how much to damp iterations should be between (0, 1]
with 1
no dampingthreshold
threshold for the residuals (e.g. for quantile regression)In general users should not use this. A bit like Newton
, NewtonCG
above, IWLSCG will typically be more expensive, but it's an interesting tool for people who are interested in solvers and provides a sanity check for other methods.
It's a bit annoying that the default isn't the default, instead of nothing if you know what I mean.
If you have a suggestion for a cleanup, maybe open an issue? (I'm actually not sure I know what you mean)
L2Loss, L2Penalty (linear regression, ridge regression) --> default is Analytical (matrix solve, possibly using an iterative solver)
What does "possibly" mean? I'm guessing iteration=false for linear and iteration=true for ridge? Is that right?
And I suppose we can add:
RobustLoss, with L1 + L2 Penalty (RobustRegressor, HuberRegressor) --> LBFGS
Yes?
L2Loss, L2Penalty (linear regression, ridge regression) --> default is Analytical (matrix solve, possibly using an iterative solver) SmoothLoss L2+L1 Penalty (lasso, elasticnet, logistic+multinomial with elastic net) --> FISTA
Looks like you are saying that the default solver for LogisticClassifier and MultinomialClassifier depends on the value of the regularisation parameters (which would explain the nothing
solver default). Is the default only Analytical(...)
if L1 penalty is zero, and FISTA otherwise? But now I'm confused because (F)ISTA aren't listed as possible solvers for those models in the current docs.
I appreciate the help but I'm think I must be asking the wrong questions. Here's what I want to do for each model M:
solver
may take on, eg, "any instance of LBFGS
, ProxGrad
". Likely all this information is contained in want you are telling me, but I feel I have to "reverse engineer" the answer.
Does this better clarify my needs?
L2Loss, L2Penalty (linear regression, ridge regression) --> default is Analytical (matrix solve, possibly using an iterative solver)
What does "possibly" mean? I'm guessing iteration=false for linear and iteration=true for ridge? Is that right?
no, both iteration=true/false
can be used for either Linear or Ridge. In both cases you just have to solve a positive definite linear system of the form $Mx = b$ (just in Ridge it's perturbed by the identity to shift the spectrum away from zero); to solve such a system you can either do a full solve $M\b$ (using cholsolve) or you can use an iterative method such as conjugate gradient or krylov or whatever. The latter (iterative) can be good when the dimensionality of the problem is large.
In general though, users should just use iterative=false
, the full backsolve will work very well most of the time.
RobustLoss, with L1 + L2 Penalty (RobustRegressor, HuberRegressor) --> LBFGS
RobustLoss + L2 --> LBFGS RobustLoss + L2 + L1 --> FISTA
Looks like you are saying that the default solver for LogisticClassifier and MultinomialClassifier depends on the value of the regularisation parameters (which would explain the nothing solver default)
As soon as you have a non-smooth penalty such as L1, we cannot use smooth solvers and have to resort to proxgrad. So yes as soon as there's a non-zero coefficient in front of the L1 penalty, a FISTA solver is picked.
But now I'm confused because (F)ISTA aren't listed as possible solvers for those models in the current docs.
state clearly in docs what values the field solver may take on, eg, "any instance of LBFGS, ProxGrad". state clearly what the default value is; if this is "dynamic", ie depends on values of other parameters, then I want a concise statement of the logic needed to determine what solver will be chosen.
isn't what I quoted in my previous answer under defaults
what you wanted?
Maybe to simplify (I'm aware you have limited bandwidth and that it's not helping to have a long conversation), how about we do this just for Linear+Ridge in a draft PR, we get to a satisfactory point and then we progress from there?
MLJ constructors:
LinearRegressor
, RidgeRegressor
for both the solver
can be specified to be Analytical(...)
. The default is Analytical()
. Difference with default is if the user passes ;iterative=true
in which case they may also specify max_inner
@tlienart Thanks for the additional help and your patience. #138 is now ready for your review.
I'm considering having a stab at some of #135 but could do with some help.
This appears in this doc page.
MLJ
model types are not listed. Would be good to have this, to save some detective work (and the user certainly wants this anyway). To make it easier, I'm copying the lists below:yᵢ∈{±1}
yᵢ∈{±1}
yᵢ∈{1,...,c}
yᵢ∈{1,...,c}
@tlienart @jbrea