Closed azev77 closed 4 years ago
@azev77 Thanks for this. This sounds like a great idea.
It has been suggested to me privately that something like the above be implemented by adding a model_selection_strategy
field to TunedModel
, which would allow user to choose, say, the "one-standard error" model or the "optimal" model.
edited
One obstacle to implementation is that currently choosing the optimal model is left up to the tuning strategy (by implementing best
appropriately). This was to allow for more complicated tuning strategies like multi-objective (Pareto) to do their own thing. I suppose that instead we could:
introduce a new abstract type ModelSelectionStrategy
and best
becomes a method to be implemented for each instance of these, and we would start with the two selections strategies mentioned above. So tuning strategies no longer implement best
(mildly breaking)
We add a trait for tuning strategies which lists which selection strategies are supported (with fallbacks the two above).
Support user custom model selection, by allowing model_selection_strategy
to be a callable object with appropriate signature (model_selection_strategy(history) -> model
?).
Firm up requirements of the result objects returned by the result
method. Currently, the history is a collection of (model, r)
pairs (where model
= hyperparameter set) and the format of r
(the output of result
) is left up to the tuning-strategy. This r
generally contains multiple scores because: (i) the user may be interested in metrics beyond the one defining the objective, and (ii) we want to include multi-objective (Pareto) optimisation strategies. Also, the result may contain strategy-specific data. For example the Tree Parzen optimisation strategy coming soon needs to record a custom representation of the model there. We do require r
to be a named tuple, and in the case of the 3 built-in tuning strategies there is always a key called measurement
which is a vector of one or more scores, the first corresponding to the metric defining the objective function. So we could simply make that a firm requirement, yes?
@yalwan-iqvia Can you confirm the above statement holds for Tree Parzen?
@azev77
Parsimony a bit subjective, because there are different notions of parsimony. There can be a standard default option for each model
Yes, in principle. In the meantime, there needs to be a fallback option, independent of model. Can you think of something better than random selection?
Thoughts?
Minor point: model_selection_strategy
can mean different things.
Maybe HP_selection_strategy
but that sounds weird.
In general, HP_selection_strategy
is a function of (at least) two variables: HP_Grid
& Score
.
Eg: for Lasso, we have a grid of lambdas & a vector of "scores" for each lambda
The default HP_selection_strategy
: returns the HP vector that minimizes the score
f(HP_Grid, Score)= HP_Grid[ argmin[Score] ]
Note: this is sometimes min (RMSE), sometimes max (Accuracy)
Not sure I like the phrase best
or optimal
bc it is only best/optimal in some context...
The second most common HP_selection_strategy
: returns the HP vector 1 se above the one that minimizes the score.
(1 se below the one that maximizes the score)
Note: below I will make the number of se a keyword arg w/ default =1
function f(HP_Grid, Score; n_se=1)
se= StandardError(Score)
m= min[Score] + n_se * se
ix= #Index of Score closest to m # at least as regularized as HP_Grid[ argmin[Score] ]
# only consider: lambda >= HP_Grid[ argmin[Score] ]
hp= HP_Grid[ ix ]
return hp
end
Tibshirani & co view regularized as a synonym for "parsimonious" & select the biggest lambda w/ score closest to m. {That probably makes sense.}
Finally, a user can define her own favorite HP_selection_strategy
:
function f(HP_Grid, Score; \theta =1)
...
return hp
end
In general, HP_selection_strategy is a function of (at least) two variables: HP_Grid & Score
The history contains both the models (hyperparameter values, or "grid" in your sense, I guess) and the corresponding scores. See the readme of this repo for a detailed description of the existing API, which furthermore explains all terminology I am using in the post above.
Note: this is sometimes min (RMSE), sometimes max (Accuracy)
At present any MLJ supported measure or measures can be specified when constructing a TunedModel
instance. See the readme.
Finally, a user can define her own favorite HP_selection_strategy
Yes, this is more-or-less what I am suggesting, no?
Yes it is what your suggesting. I tend to be pedantic.
I just looked @: https://github.com/alan-turing-institute/MLJTuning.jl
I see history returns tuple (m , r)
since m
contains the grid & r
contains scores we're in business...
@yalwan-iqvia Can you confirm the above statement holds for Tree Parzen?
Actually, yes, I worked (quite) hard to design around the existing design -- one of this was about being able to not have to implement best
ourselves, and so we ensured that the result
function returned as similar as possible to fallback version, which, yes, includes the measurement
key.
Also additional remark, I think we would be happy to accommodate breaking changes.
@yalwan-iqvia Thanks for the clarification. Unfortunately, I think you can expect breaking changes in most 0.x open-source software projects 😄, and we're only at 0.2.x.
Let's sit on this a wee bit before committing to anything.
@azev77 Although one can distinguish between models that come from the same algorithm and models the differ only in the values of hyperparameters, some model selection criterion might apply in the greater generality and be useful in the specific case here, no? If so, there could be overlap with what we do here and the benchmark design. What do you think?
@azev77 and I had a productive discussion just now.
I got to thinking, nothing stops us from defining a different TuningStrategy
object of our own which overrides best
to do this, right? It's possible that we can provide multiple implementations, one of which does this based on some user provided setting.
Alternatively, take a callable which allows the user to define an arbitrary function which speficies "preferability" of a model given its hyperparameters.
I have some other questions though, for @ablaom
Right now, the measurement is, well the measurement the user selected (or implemented with a custom function) -- but if they have selected a resampling strategy, it is an aggregate measure, presumably mean? Is there any option at the moment to control how this measure is aggregated? Would it even be possible to provide all individual measurements and allow TuningStrategy
to do what it wants with those?
I got to thinking, nothing stops us from defining a different TuningStrategy object of our own which overrides best to do this, right? It's possible that we can provide multiple implementations, one of which does this based on some user provided setting.
Yes, but why make not provide a common interface for all tuning strategies? Itseems to me the choice of "decision rule" for deciding best model is orthogonal to the choice of tuning strategy, no?
Right now, the measurement is, well the measurement the user selected (or implemented with a custom function) -- but if they have selected a resampling strategy, it is an aggregate measure, presumably mean? Is there any option at the moment to control how this measure is aggregated? Would it even be possible to provide all individual measurements and allow TuningStrategy to do what it wants with those?
You already have access to per-observation measurements, if the measure supports this (eg, l2
but not auc
or rms
) and not just their aggregation over the sample. If e
is an evaluation, then e.per_observation
is a vector - one element for each measure - and each of these elements is a vector - one for each fold in the resampling - and each element of that is a per-observation evaluation of the measure. Is this what you want? You also have access to per-fold aggregates, in addition to the overall aggregate. A measure trait shows what aggregation is used, but this going to be mean or sum generally.
julia> info(l2)
squared deviations; aliases: `l2`.
(name = "l2",
target_scitype = Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}},
supports_weights = true,
prediction_type = :deterministic,
orientation = :loss,
reports_each_observation = true, # <-------- yes reports observations
aggregation = MLJBase.Mean(), # <--------- aggregation type
is_feature_dependent = false,
docstring = "squared deviations; aliases: `l2`.",
distribution_type = missing,)
julia> X, y = @load_boston
julia> e = evaluate(@load(DeterministicConstantRegressor), X, y, measure=l1)
Evaluating over 6 folds: 100%[=========================] Time: 0:00:00
┌───────────┬───────────────┬──────────────────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼──────────────────────────────────────┤
│ l1 │ 7.16 │ [4.34, 5.63, 8.87, 7.05, 9.67, 7.37] │
└───────────┴───────────────┴──────────────────────────────────────┘
_.per_observation = [[[1.26, 1.14, ..., 1.16], [3.98, 0.124, ..., 0.324], [3.73, 2.03, ..., 21.7], [0.356, 0.644, ..., 3.04], [2.84, 4.44, ..., 9.24], [2.92, 10.3, ..., 11.8]]]
julia> e.per_observation[1]
6-element Array{Array{Float64,1},1}:
[1.2615201900237487, 1.1384798099762499, 11.961520190023752, 10.661520190023747, 13.461520190023752, 5.961520190023748, 0.1615201900237473, 4.36152019002375, 6.238479809976251, 3.8384798099762527 … 1.3384798099762527, 2.7384798099762513, 1.9384798099762506, 1.538479809976252, 2.4384798099762506, 5.261520190023749, 1.1615201900237473, 2.0615201900237494, 0.1615201900237473, 1.1615201900237473]
[3.9764845605700714, 0.12351543942993004, 0.42351543942993075, 0.9764845605700714, 6.076484560570069, 0.02351543942992862, 0.62351543942993, 0.27648456057006854, 2.37648456057007, 2.0235154394299286 … 4.37648456057007, 27.37648456057007, 27.37648456057007, 27.37648456057007, 0.07648456057006925, 2.37648456057007, 27.37648456057007, 1.1764845605700707, 1.1764845605700707, 0.32351543942992933]
[3.725118483412327, 2.0251184834123244, 1.9748815165876756, 2.4748815165876756, 1.4748815165876756, 8.274881516587673, 2.0748815165876735, 3.4748815165876756, 8.774881516587673, 16.074881516587677 … 3.5251184834123244, 2.625118483412326, 3.174881516587675, 0.6251184834123258, 3.374881516587674, 5.0748815165876735, 3.274881516587673, 3.674881516587675, 8.474881516587676, 21.67488151658767]
[0.35592417061611314, 0.6440758293838869, 22.455924170616115, 28.455924170616115, 14.455924170616115, 8.555924170616116, 12.255924170616112, 21.555924170616116, 27.25592417061611, 9.455924170616115 … 2.2440758293838847, 1.055924170616116, 1.7440758293838847, 4.444075829383884, 2.144075829383887, 0.6559241706161139, 0.8440758293838861, 0.444075829383884, 2.0440758293838854, 3.0440758293838854]
[2.8421800947867304, 4.442180094786732, 4.7421800947867325, 9.257819905213271, 6.942180094786732, 0.45781990521326676, 7.7578199052132675, 5.942180094786732, 6.2421800947867325, 0.3421800947867304 … 5.542180094786733, 7.142180094786731, 16.442180094786732, 16.242180094786733, 15.942180094786732, 13.042180094786731, 14.642180094786731, 15.042180094786731, 6.7421800947867325, 9.242180094786733]
[2.9241706161137486, 10.324170616113749, 12.02417061611375, 15.424170616113749, 13.52417061611375, 12.824170616113749, 12.72417061611375, 14.22417061611375, 9.22417061611375, 9.62417061611375 … 4.02417061611375, 5.424170616113749, 2.52417061611375, 6.224170616113749, 6.924170616113749, 1.3241706161137508, 3.124170616113748, 0.17582938388624925, 1.7241706161137493, 11.824170616113749]
julia>
Unfortunately, not all measures that could report per-observation do so. This is not hard to fix, just hard to find someone willing to do it 😄
Ok I didn't realise per-fold numbers were available, that's good to know.
Itseems to me the choice of "decision rule" for deciding best model is orthogonal to the choice of tuning strategy, no?
It is, but that doesn't mean it couldn't be done in the interim in the absence of the standard interface, given the right information, and for me it would be a fun exercise.
Also, I think per_fold should be named per_resample, since folds are specific to cross-validation, and not all resampling strategies have these "folds".
not all resampling strategies have these "folds".
Sounds interesting. In MLJ, the resampling API only allows implementation of resampling strategies that generate a vector of (train, test)
pairs, where train
and test
are subsets of a set of provided indices, rows
for the data. There's no restriction on how these interact (overlap, and so forth). If you have something that does not fit in here, please open an issue at MLJBase.jl with a use- case. (You mentioned "Monte Carlo resampling" somewhere, but given the breadth of application of these terms, I probably need a bit of detail or a reference, thanks!).
This stack exchange post does a lot of the hard work for me here: https://stats.stackexchange.com/questions/51416/k-fold-vs-monte-carlo-cross-validation
An answerer points out that sklearn seems to have this, in a multitude of flavours. no less, as well.
Looks to me like this fits into the existing framework https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/#Custom-resampling-strategies-1 just fine. You're still generating a bunch of (train, test) sets of indices. It just so happens that there is potential overlap between individual trains and individual tests.
Am I missing something here?
If you wanted the folds lazily generated, that might require some changes to the API, but I'd say that's probably an unnecessary optimisation.
Okay, I think Monte-Carlo is already possible; see https://github.com/alan-turing-institute/MLJ.jl/issues/564#issuecomment-642932853
My point about folds there was purely the naming as opposed to something else, i.e. I think per_fold
should be called per_resample
My point about folds there was purely the naming as opposed to something else, i.e. I think
per_fold
should be calledper_resample
Yes, I agree that per_resample
is better than per_fold
. However, given the name change would be breaking, I'm disinclined to change it.
Suppose I split a data set into: XT/YT (training) & XH/YH (holdout) I train a Lasso model w/ one HP: lambda in HP_Grid.
In all-subset regression parsimonious = model w/ fewest non-zero coef.
Bottom line: a flexible ML interface gives the user the ability to choose her favorite model selection procedure, given her objectives
@ryantibs illustrates this nicely in his slides: Here is how I do this in R:
Next train the model on the FULL XT/YT using preferred lambda. Compute the score on XH/YH. Often "lambda.1se" outperforms "lambda.min" out-of-sample in the Holdout data.
Parsimony a bit subjective, because there are different notions of parsimony. There can be a standard default option for each model & enough flexibility for the user to define her own procedure.