use dask-searchcv for evolutionary search EaSearchCV

PeterDSteinberg commented 7 years ago

Refactor of evolutionary algorithms as EaSearchCV (subclass of dask_searchcv.DaskBaseSearchCV):

Allows cross validation test/train splits within each individual of a genetic algorithm (i.e. hyperparameterization based on models' scores in test rather than training batches)
Uses dask parallelism of dask_searchcv rather than the Phase I elm approach
Better organization of model scores (see the cv_results_ attribute of EaSearchCV)
Improves modularity

Example usage:

from collections import OrderedDict

from dask_glm.datasets import make_regression as dsk_make_regression
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from xarray_filters import MLDataset
from xarray_filters.datasets import _make_base
from elm.model_selection.ea_searchcv import EaSearchCV

dsk_make_regression = _make_base(dsk_make_regression)
shape = (100, 10, 5, 2)
dset = dsk_make_regression(shape=shape, n_samples=np.prod(shape))
dv = OrderedDict([(k, v) for k, v in dset.data_vars.items()
                  if k != 'y'])
X = MLDataset(dv)
y = dset.y.values.ravel()
X = X.to_features().features.values
data_source = dict(X=X, y=y)

pipe = Pipeline([('poly', PolynomialFeatures()),
                 ('pca', PCA()),
                 ('reg', LinearRegression())])

param_grid = dict(poly__degree=list(range(1, 3)),
                  poly__interaction_only=[True, False],
                  reg__fit_intercept=[True, False],
                  reg__normalize=[True, False],
                  pca__n_components=list(range(3, 12)))

k = 40
mu = 20
ngen = 10
mutpb = 0.4
cxpb = 0.6
param_grid_name = 'example_1'

ea = EaSearchCV(estimator=pipe,
                param_grid=param_grid,
                score_weights=[1],
                k=k,
                mu=mu,
                ngen=ngen,
                mutpb=mutpb,
                cxpb=cxpb,
                param_grid_name=param_grid_name,
                early_stop=None,
                toolbox=None,
                scoring=None,
                refit=False,
                cv=None,
                error_score='raise',
                return_train_score=True,
                scheduler=None,
                n_jobs=-1,
                cache_cv=True)
ea.fit(X, y=y)

PeterDSteinberg commented 7 years ago

TODO:

[ ] See Issue #185 and make sure all of those evolutionary search improvements are handled in this PR or a separate issue(s) is created.
- [ ] docstrings
- [ ] Be sure to describe the score_weights parameter is used correctly related to deap (it is used to flip minimization to maximization) and make sure the examples use it correctly (IIRC, there is some confusion currently in the elm docs from Phase I).
- [ ] doctests

PeterDSteinberg commented 7 years ago

Here are the py.test -m "not slow" -vvvv (skipping slow tests and running with verbose flag) output.

The tests show 18 failed, 1866 passed, 1941 skipped, 194 deselected, 15 warnings in 383.18 seconds pytest_vvv_not_slow_tuesday_october_11_results.txt

Over the next day I'll continue commenting on existing issues and making new ones (about 4 or 6) that relate to the 18 test failures. Those test failures do not delay the merge of this PR as some are "expected failures" (not marked as such in py.test but expected to fail because we have not completed all of data structure flexibility goals).

@gbrener Could you checkout this branch and run the py.test command in Py 3.6 / 2.7 locally and pipe your output to a similar file so we can check the number of failures is the same or explain why different. I constructed my env by install elm from the anaconda elm 3.5 dev branch to get the environment, then installed from this branch + xarray_filters PR 19

ContinuumIO / elm

use dask-searchcv for evolutionary search EaSearchCV #192