Options to Select the "Best" Hyperparameter

azev77 commented 4 years ago

Suppose I split a data set into: XT/YT (training) & XH/YH (holdout) I train a Lasso model w/ one HP: lambda in HP_Grid.

For each lambda, I train the model within subsets of XT/YT, using my favorite resampling procedure (K fold CV etc).
For each lambda, I compute my favorite resampling score (RMSE etc) within folds of XT/YT not used for training. (This is an out-of sample score w/i the training data.) Now, for each lambda, I have a score. Q: how do I select the "optimal hyper-parameter"
A: the "usual approach" is to select the lambda with the minimum score (minimum error/maximum R2...) B: Tibshirani & co warn this may not be prudent! Instead they recommend: select lambda 1 standard error (se) above the minimum CV error. Often, there may be multiple such lambda, they recommend the most parsimonious such model. In this context most parsimonious = most regularized = biggest lambda.
In all-subset regression parsimonious = model w/ fewest non-zero coef.

Bottom line: a flexible ML interface gives the user the ability to choose her favorite model selection procedure, given her objectives

@ryantibs illustrates this nicely in his slides: Here is how I do this in R:

  #Train Lasso
  m= cv.glmnet(XT, YT,alpha=1, nfolds=nf, intercept=T) 
  # select HP w/ minimum CV error
  pr= predict(m, newx=XH, s="lambda.min")
  os= sc(YH, pr)   #RMSE 
  # select HP w/ one se rule
  pr= predict(m, newx=XH, s="lambda.1se")
  os= sc(YH, pr)   #RMSE

Next train the model on the FULL XT/YT using preferred lambda. Compute the score on XH/YH. Often "lambda.1se" outperforms "lambda.min" out-of-sample in the Holdout data.

Parsimony a bit subjective, because there are different notions of parsimony. There can be a standard default option for each model & enough flexibility for the user to define her own procedure.

ablaom commented 4 years ago

@azev77 Thanks for this. This sounds like a great idea.

It has been suggested to me privately that something like the above be implemented by adding a model_selection_strategy field to TunedModel, which would allow user to choose, say, the "one-standard error" model or the "optimal" model.

edited

One obstacle to implementation is that currently choosing the optimal model is left up to the tuning strategy (by implementing best appropriately). This was to allow for more complicated tuning strategies like multi-objective (Pareto) to do their own thing. I suppose that instead we could:

introduce a new abstract type ModelSelectionStrategy and best becomes a method to be implemented for each instance of these, and we would start with the two selections strategies mentioned above. So tuning strategies no longer implement best (mildly breaking)
We add a trait for tuning strategies which lists which selection strategies are supported (with fallbacks the two above).
Support user custom model selection, by allowing model_selection_strategy to be a callable object with appropriate signature (model_selection_strategy(history) -> model?).
Firm up requirements of the result objects returned by the result method. Currently, the history is a collection of (model, r) pairs (where model = hyperparameter set) and the format of r (the output of result) is left up to the tuning-strategy. This r generally contains multiple scores because: (i) the user may be interested in metrics beyond the one defining the objective, and (ii) we want to include multi-objective (Pareto) optimisation strategies. Also, the result may contain strategy-specific data. For example the Tree Parzen optimisation strategy coming soon needs to record a custom representation of the model there. We do require r to be a named tuple, and in the case of the 3 built-in tuning strategies there is always a key called measurement which is a vector of one or more scores, the first corresponding to the metric defining the objective function. So we could simply make that a firm requirement, yes?

@yalwan-iqvia Can you confirm the above statement holds for Tree Parzen?

@azev77

Parsimony a bit subjective, because there are different notions of parsimony. There can be a standard default option for each model

Yes, in principle. In the meantime, there needs to be a fallback option, independent of model. Can you think of something better than random selection?

Thoughts?

azev77 commented 4 years ago

Minor point: model_selection_strategy can mean different things.

Select between multiple different models {Flux, Lasso, Ridge}
Select the optimal HP vector for a given model {optimal lambda for Lasso}

Maybe HP_selection_strategy but that sounds weird.

In general, HP_selection_strategy is a function of (at least) two variables: HP_Grid & Score. Eg: for Lasso, we have a grid of lambdas & a vector of "scores" for each lambda

The default HP_selection_strategy: returns the HP vector that minimizes the score

f(HP_Grid, Score)= HP_Grid[ argmin[Score] ]

Note: this is sometimes min (RMSE), sometimes max (Accuracy) Not sure I like the phrase best or optimal bc it is only best/optimal in some context...

The second most common HP_selection_strategy: returns the HP vector 1 se above the one that minimizes the score. (1 se below the one that maximizes the score) Note: below I will make the number of se a keyword arg w/ default =1

function f(HP_Grid, Score; n_se=1)
        se= StandardError(Score)
        m= min[Score] + n_se * se
        ix= #Index of Score closest to m  # at least as regularized as HP_Grid[ argmin[Score] ]
        # only consider: lambda >= HP_Grid[ argmin[Score] ]
        hp= HP_Grid[ ix ]
        return hp
end

Tibshirani & co view regularized as a synonym for "parsimonious" & select the biggest lambda w/ score closest to m. {That probably makes sense.}

Finally, a user can define her own favorite HP_selection_strategy:

function f(HP_Grid, Score; \theta =1)
        ...
        return hp
end

ablaom commented 4 years ago

In general, HP_selection_strategy is a function of (at least) two variables: HP_Grid & Score

The history contains both the models (hyperparameter values, or "grid" in your sense, I guess) and the corresponding scores. See the readme of this repo for a detailed description of the existing API, which furthermore explains all terminology I am using in the post above.

Note: this is sometimes min (RMSE), sometimes max (Accuracy)

At present any MLJ supported measure or measures can be specified when constructing a TunedModel instance. See the readme.

Finally, a user can define her own favorite HP_selection_strategy

Yes, this is more-or-less what I am suggesting, no?

azev77 commented 4 years ago

Yes it is what your suggesting. I tend to be pedantic. I just looked @: https://github.com/alan-turing-institute/MLJTuning.jl I see history returns tuple (m , r) since m contains the grid & r contains scores we're in business...

yalwan-iqvia commented 4 years ago

@yalwan-iqvia Can you confirm the above statement holds for Tree Parzen?

Actually, yes, I worked (quite) hard to design around the existing design -- one of this was about being able to not have to implement best ourselves, and so we ensured that the result function returned as similar as possible to fallback version, which, yes, includes the measurement key.

yalwan-iqvia commented 4 years ago

Also additional remark, I think we would be happy to accommodate breaking changes.

ablaom commented 4 years ago

@yalwan-iqvia Thanks for the clarification. Unfortunately, I think you can expect breaking changes in most 0.x open-source software projects 😄, and we're only at 0.2.x.

Let's sit on this a wee bit before committing to anything.

@azev77 Although one can distinguish between models that come from the same algorithm and models the differ only in the values of hyperparameters, some model selection criterion might apply in the greater generality and be useful in the specific case here, no? If so, there could be overlap with what we do here and the benchmark design. What do you think?

yalwan-iqvia commented 4 years ago

@azev77 and I had a productive discussion just now.

I got to thinking, nothing stops us from defining a different TuningStrategy object of our own which overrides best to do this, right? It's possible that we can provide multiple implementations, one of which does this based on some user provided setting.

Alternatively, take a callable which allows the user to define an arbitrary function which speficies "preferability" of a model given its hyperparameters.

I have some other questions though, for @ablaom

Right now, the measurement is, well the measurement the user selected (or implemented with a custom function) -- but if they have selected a resampling strategy, it is an aggregate measure, presumably mean? Is there any option at the moment to control how this measure is aggregated? Would it even be possible to provide all individual measurements and allow TuningStrategy to do what it wants with those?

ablaom commented 4 years ago

I got to thinking, nothing stops us from defining a different TuningStrategy object of our own which overrides best to do this, right? It's possible that we can provide multiple implementations, one of which does this based on some user provided setting.

Yes, but why make not provide a common interface for all tuning strategies? Itseems to me the choice of "decision rule" for deciding best model is orthogonal to the choice of tuning strategy, no?

Right now, the measurement is, well the measurement the user selected (or implemented with a custom function) -- but if they have selected a resampling strategy, it is an aggregate measure, presumably mean? Is there any option at the moment to control how this measure is aggregated? Would it even be possible to provide all individual measurements and allow TuningStrategy to do what it wants with those?

You already have access to per-observation measurements, if the measure supports this (eg, l2 but not auc or rms) and not just their aggregation over the sample. If e is an evaluation, then e.per_observation is a vector - one element for each measure - and each of these elements is a vector - one for each fold in the resampling - and each element of that is a per-observation evaluation of the measure. Is this what you want? You also have access to per-fold aggregates, in addition to the overall aggregate. A measure trait shows what aggregation is used, but this going to be mean or sum generally.

julia> info(l2)
squared deviations; aliases: `l2`.
(name = "l2",
 target_scitype = Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}},
 supports_weights = true,
 prediction_type = :deterministic,
 orientation = :loss,
 reports_each_observation = true,               # <-------- yes reports observations
 aggregation = MLJBase.Mean(),                # <--------- aggregation type
 is_feature_dependent = false,
 docstring = "squared deviations; aliases: `l2`.",
 distribution_type = missing,)

julia> X, y = @load_boston
julia>  e = evaluate(@load(DeterministicConstantRegressor), X, y, measure=l1)
Evaluating over 6 folds: 100%[=========================] Time: 0:00:00
┌───────────┬───────────────┬──────────────────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold                           │
├───────────┼───────────────┼──────────────────────────────────────┤
│ l1        │ 7.16          │ [4.34, 5.63, 8.87, 7.05, 9.67, 7.37] │
└───────────┴───────────────┴──────────────────────────────────────┘
_.per_observation = [[[1.26, 1.14, ..., 1.16], [3.98, 0.124, ..., 0.324], [3.73, 2.03, ..., 21.7], [0.356, 0.644, ..., 3.04], [2.84, 4.44, ..., 9.24], [2.92, 10.3, ..., 11.8]]]

julia> e.per_observation[1]
6-element Array{Array{Float64,1},1}:
 [1.2615201900237487, 1.1384798099762499, 11.961520190023752, 10.661520190023747, 13.461520190023752, 5.961520190023748, 0.1615201900237473, 4.36152019002375, 6.238479809976251, 3.8384798099762527  …  1.3384798099762527, 2.7384798099762513, 1.9384798099762506, 1.538479809976252, 2.4384798099762506, 5.261520190023749, 1.1615201900237473, 2.0615201900237494, 0.1615201900237473, 1.1615201900237473]
 [3.9764845605700714, 0.12351543942993004, 0.42351543942993075, 0.9764845605700714, 6.076484560570069, 0.02351543942992862, 0.62351543942993, 0.27648456057006854, 2.37648456057007, 2.0235154394299286  …  4.37648456057007, 27.37648456057007, 27.37648456057007, 27.37648456057007, 0.07648456057006925, 2.37648456057007, 27.37648456057007, 1.1764845605700707, 1.1764845605700707, 0.32351543942992933] 
 [3.725118483412327, 2.0251184834123244, 1.9748815165876756, 2.4748815165876756, 1.4748815165876756, 8.274881516587673, 2.0748815165876735, 3.4748815165876756, 8.774881516587673, 16.074881516587677  …  3.5251184834123244, 2.625118483412326, 3.174881516587675, 0.6251184834123258, 3.374881516587674, 5.0748815165876735, 3.274881516587673, 3.674881516587675, 8.474881516587676, 21.67488151658767]    
 [0.35592417061611314, 0.6440758293838869, 22.455924170616115, 28.455924170616115, 14.455924170616115, 8.555924170616116, 12.255924170616112, 21.555924170616116, 27.25592417061611, 9.455924170616115  …  2.2440758293838847, 1.055924170616116, 1.7440758293838847, 4.444075829383884, 2.144075829383887, 0.6559241706161139, 0.8440758293838861, 0.444075829383884, 2.0440758293838854, 3.0440758293838854]
 [2.8421800947867304, 4.442180094786732, 4.7421800947867325, 9.257819905213271, 6.942180094786732, 0.45781990521326676, 7.7578199052132675, 5.942180094786732, 6.2421800947867325, 0.3421800947867304  …  5.542180094786733, 7.142180094786731, 16.442180094786732, 16.242180094786733, 15.942180094786732, 13.042180094786731, 14.642180094786731, 15.042180094786731, 6.7421800947867325, 9.242180094786733]
 [2.9241706161137486, 10.324170616113749, 12.02417061611375, 15.424170616113749, 13.52417061611375, 12.824170616113749, 12.72417061611375, 14.22417061611375, 9.22417061611375, 9.62417061611375  …  4.02417061611375, 5.424170616113749, 2.52417061611375, 6.224170616113749, 6.924170616113749, 1.3241706161137508, 3.124170616113748, 0.17582938388624925, 1.7241706161137493, 11.824170616113749]         

julia>

Unfortunately, not all measures that could report per-observation do so. This is not hard to fix, just hard to find someone willing to do it 😄

yalwan-iqvia commented 4 years ago

Ok I didn't realise per-fold numbers were available, that's good to know.

Itseems to me the choice of "decision rule" for deciding best model is orthogonal to the choice of tuning strategy, no?

It is, but that doesn't mean it couldn't be done in the interim in the absence of the standard interface, given the right information, and for me it would be a fun exercise.

yalwan-iqvia commented 4 years ago

Also, I think per_fold should be named per_resample, since folds are specific to cross-validation, and not all resampling strategies have these "folds".

ablaom commented 4 years ago

not all resampling strategies have these "folds".

Sounds interesting. In MLJ, the resampling API only allows implementation of resampling strategies that generate a vector of (train, test) pairs, where train and test are subsets of a set of provided indices, rows for the data. There's no restriction on how these interact (overlap, and so forth). If you have something that does not fit in here, please open an issue at MLJBase.jl with a use- case. (You mentioned "Monte Carlo resampling" somewhere, but given the breadth of application of these terms, I probably need a bit of detail or a reference, thanks!).

yalwan-iqvia commented 4 years ago

This stack exchange post does a lot of the hard work for me here: https://stats.stackexchange.com/questions/51416/k-fold-vs-monte-carlo-cross-validation

An answerer points out that sklearn seems to have this, in a multitude of flavours. no less, as well.

ablaom commented 4 years ago

Looks to me like this fits into the existing framework https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/#Custom-resampling-strategies-1 just fine. You're still generating a bunch of (train, test) sets of indices. It just so happens that there is potential overlap between individual trains and individual tests.

Am I missing something here?

ablaom commented 4 years ago

If you wanted the folds lazily generated, that might require some changes to the API, but I'd say that's probably an unnecessary optimisation.

ablaom commented 4 years ago

Okay, I think Monte-Carlo is already possible; see https://github.com/alan-turing-institute/MLJ.jl/issues/564#issuecomment-642932853

yalwan-iqvia commented 4 years ago

My point about folds there was purely the naming as opposed to something else, i.e. I think per_fold should be called per_resample

ablaom commented 4 years ago

My point about folds there was purely the naming as opposed to something else, i.e. I think per_fold should be called per_resample

Yes, I agree that per_resample is better than per_fold. However, given the name change would be breaking, I'm disinclined to change it.

JuliaAI / MLJ.jl

Options to Select the "Best" Hyperparameter #487