Closed PeterDSteinberg closed 7 years ago
PR #192 works Stretch Goal at the bottom of the comment with elm.model_selection
improvements. This issue needs to remain open to address the other bullets above regarding parameters that are discrete/continuous
Elm PR #192 in elm.model_selection.EaSearchCV
now addresses more of the bullets above (the checked boxes). Further explanation on the boxes I checked above:
Regarding:
EaSearchCV
takes those arguments plus a model_selection
argument of EA control parameters (e.g. cxpb
crossover probability). EaSearchCV
runs RandomizedSearchCV
(with modifications) for each generation of an evolutionary search, so the documentation on RandomizedSearchCV
's param_distributions
argument is helpful. param_distributions
is a dictionary of string keys mapped to values that may be a callable with a rvs
method to create random variates, e.g. scipy.stats.lognorm(4)
, or a list of enumerated items that are acceptable, including lists of strings for string parameters. See also the documentation on sklearn.model_selection.ParameterSampler - that is being used by RandomizedSearchCV
to interpret the parameter_distributions
. EaSearchCV
uses ParameterSampler
when creating new random individuals, but also uses mutation, crossover, and selection operators from deap
to create new parameter sets.dask-searchcv
source regarding limitations of changing a Pipeline
's .steps
parameter within a DaskSearchBaseCV
. Note this idea may be partially solved by the user writing a relatively simple class that inherits from a sklearn
or other estimator, writing a fit method that may optionally skip the actual work as in this example (So this is in part a documentation issue for EA / Elm on how to do it).How do [I] handle cases where the user wants to hyperparameterize, either integer, continuous or mixed problems, where there is some combination of parameters that should not be considered?
. Currently this is not addressed in the EaSearchCV
when the user is giving a model_selection
dictionary of NSGA-2 EA control parameters (and we should look at ways of doing so in EA in a separate issue/PR), but if the user gives model_selection
as a callable, then that callable returns an list of user-chosen parameter dictionaries and the function could handle any custom combinations of parameters that need to be avoided.See #205 #204 #203 #198
This is now being tracked in separate issues (see comments above about #192 work towards this issue).
I had a meeting with Grey Nearing (NASA - Goddard) to discuss hyperparameterization in hydrology, for statistical models as in Elm or for physical hydrology models. We discussed a few improvements to do on the evolutionary algorithms for statistical models:
elm.pipeline.evolve_train
and related code) where some parameters are continuous variables rather than enumerated choices. This would involve writing a mutation method that is custom for each parameter, usingmutUniformInt
(see example below) for discrete choice problems, and a different mutation method fromdeap
for continuous parameters.np.random.lognormal
?ensemble
orfit_ensemble
where custom model selection functions are used, we need to consider cases where the model structure is changing throughout the optimization, e.g. choices for number of components in a PCA preprocessing step that are[4, 5, None]
to indicate 4 or 5 components or no PCA at all.[2,3,4]
fora
and one of[10,20,30]
forb
but never the combination of(2,20)
fora,b
. I think this is handled by the param grid specification in scikit-learn and we should follow that convention.Example NSGA-2 control specification for
elm.pipeline.evolve_train
as it works currently from Phase I: