Improvements for evolutionary algorithms

PeterDSteinberg commented 7 years ago

I had a meeting with Grey Nearing (NASA - Goddard) to discuss hyperparameterization in hydrology, for statistical models as in Elm or for physical hydrology models. We discussed a few improvements to do on the evolutionary algorithms for statistical models:

[x] Research ways to support integer combinatoric problems (what is currently handled by elm.pipeline.evolve_train and related code) where some parameters are continuous variables rather than enumerated choices. This would involve writing a mutation method that is custom for each parameter, using mutUniformInt (see example below) for discrete choice problems, and a different mutation method from deap for continuous parameters.
- [x] Continuous parameters could be specified with one or more of the following options:
  - [ ] Simple upper / lower bounds with uniform sample between them
  - [x] Random distribution with bounds? Named distributions or passing in a random generator like np.random.lognormal?
[ ] Also in evolutionary algorithms, as well as the function currently called ensemble or fit_ensemble where custom model selection functions are used, we need to consider cases where the model structure is changing throughout the optimization, e.g. choices for number of components in a PCA preprocessing step that are [4, 5, None] to indicate 4 or 5 components or no PCA at all.
[x] How do we handle cases where the user wants to hyperparameterize, either integer, continuous or mixed problems, where there is some combination of parameters that should not be considered, e.g. I want one of [2,3,4] for a and one of [10,20,30] for b but never the combination of (2,20) for a,b. I think this is handled by the param grid specification in scikit-learn and we should follow that convention.
[ ] Stretch goal - how can we extend the evolutionary algorithm ideas a level higher, to optimally select parameters of models, structures of models, or even alternate approaches (e.g. select a stats model from Elm with hyperparameterization vs VIC vs Noah MP).

Example NSGA-2 control specification for elm.pipeline.evolve_train as it works currently from Phase I:

nsga_control = {
    'select_method': 'selNSGA2',
    'crossover_method': 'cxTwoPoint',  # TODO can we modify for float/int
    'mutate_method': 'mutUniformInt',  # TODO same comment as ^^
    'init_pop': 'random',
    'indpb': 0.6,     # probability of each attribute changing
    'mutpb': 0.9,     # probability of mutation
    'cxpb':  0.5,     # probability of crossover
    'eta':   20,      # eta: Crowding degree of the crossover
                      # A high eta will produce children resembling
                      # to their parents, while a small eta will
                      # produce solutions much more different.
    'ngen':  3,       # Number of generations
    'mu':    64,      # Population size
    'k':     24,      # Number selected to move to next generation
    'early_stop': {'threshold': [0, 1, 1], 'agg': all},
    # alternatively 'early_stop': {'abs_change': [10], 'agg': 'all'},
    # alternatively early_stop: {percent_change: [10], agg: all}
    # alternatively early_stop: {threshold: [10], agg: any}
}

PeterDSteinberg commented 7 years ago

PR #192 works Stretch Goal at the bottom of the comment with elm.model_selection improvements. This issue needs to remain open to address the other bullets above regarding parameters that are discrete/continuous

PeterDSteinberg commented 7 years ago

Elm PR #192 in elm.model_selection.EaSearchCV now addresses more of the bullets above (the checked boxes). Further explanation on the boxes I checked above:

Regarding:

Supporting integer combinatoric optimization and continuous parameters in the same problem:
- See the docs on sklearn.model_selection.RandomizedSearch - EaSearchCV takes those arguments plus a model_selection argument of EA control parameters (e.g. cxpb crossover probability). EaSearchCV runs RandomizedSearchCV (with modifications) for each generation of an evolutionary search, so the documentation on RandomizedSearchCV's param_distributions argument is helpful. param_distributions is a dictionary of string keys mapped to values that may be a callable with a rvs method to create random variates, e.g. scipy.stats.lognorm(4), or a list of enumerated items that are acceptable, including lists of strings for string parameters. See also the documentation on sklearn.model_selection.ParameterSampler - that is being used by RandomizedSearchCV to interpret the parameter_distributions. EaSearchCV uses ParameterSampler when creating new random individuals, but also uses mutation, crossover, and selection operators from deap to create new parameter sets.
- Optimizing where the model structure may be changed during optimization. Changes in Elm PR #192 to more succinctly wrap scikit-learn estimators for use with dask and xarray are likely helpful here but I see notes in dask-searchcv source regarding limitations of changing a Pipeline's .steps parameter within a DaskSearchBaseCV. Note this idea may be partially solved by the user writing a relatively simple class that inherits from a sklearn or other estimator, writing a fit method that may optionally skip the actual work as in this example (So this is in part a documentation issue for EA / Elm on how to do it).
- Regarding the question: How do [I] handle cases where the user wants to hyperparameterize, either integer, continuous or mixed problems, where there is some combination of parameters that should not be considered? . Currently this is not addressed in the EaSearchCV when the user is giving a model_selection dictionary of NSGA-2 EA control parameters (and we should look at ways of doing so in EA in a separate issue/PR), but if the user gives model_selection as a callable, then that callable returns an list of user-chosen parameter dictionaries and the function could handle any custom combinations of parameters that need to be avoided.

PeterDSteinberg commented 7 years ago

See #205 #204 #203 #198

PeterDSteinberg commented 7 years ago

This is now being tracked in separate issues (see comments above about #192 work towards this issue).

ContinuumIO / elm

Improvements for evolutionary algorithms #185