ContinuumIO / elm

Phase I & part of Phase II of NASA SBIR - Parallel Machine Learning on Satellite Data
http://ensemble-learning-models.readthedocs.io
43 stars 27 forks source link

use dask-searchcv for evolutionary search EaSearchCV #192

Closed PeterDSteinberg closed 6 years ago

PeterDSteinberg commented 6 years ago

Refactor of evolutionary algorithms as EaSearchCV (subclass of dask_searchcv.DaskBaseSearchCV):

Example usage:

from collections import OrderedDict

from dask_glm.datasets import make_regression as dsk_make_regression
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from xarray_filters import MLDataset
from xarray_filters.datasets import _make_base
from elm.model_selection.ea_searchcv import EaSearchCV

dsk_make_regression = _make_base(dsk_make_regression)
shape = (100, 10, 5, 2)
dset = dsk_make_regression(shape=shape, n_samples=np.prod(shape))
dv = OrderedDict([(k, v) for k, v in dset.data_vars.items()
                  if k != 'y'])
X = MLDataset(dv)
y = dset.y.values.ravel()
X = X.to_features().features.values
data_source = dict(X=X, y=y)

pipe = Pipeline([('poly', PolynomialFeatures()),
                 ('pca', PCA()),
                 ('reg', LinearRegression())])

param_grid = dict(poly__degree=list(range(1, 3)),
                  poly__interaction_only=[True, False],
                  reg__fit_intercept=[True, False],
                  reg__normalize=[True, False],
                  pca__n_components=list(range(3, 12)))

k = 40
mu = 20
ngen = 10
mutpb = 0.4
cxpb = 0.6
param_grid_name = 'example_1'

ea = EaSearchCV(estimator=pipe,
                param_grid=param_grid,
                score_weights=[1],
                k=k,
                mu=mu,
                ngen=ngen,
                mutpb=mutpb,
                cxpb=cxpb,
                param_grid_name=param_grid_name,
                early_stop=None,
                toolbox=None,
                scoring=None,
                refit=False,
                cv=None,
                error_score='raise',
                return_train_score=True,
                scheduler=None,
                n_jobs=-1,
                cache_cv=True)
ea.fit(X, y=y)
PeterDSteinberg commented 6 years ago

TODO:

PeterDSteinberg commented 6 years ago

Here are the py.test -m "not slow" -vvvv (skipping slow tests and running with verbose flag) output.

The tests show 18 failed, 1866 passed, 1941 skipped, 194 deselected, 15 warnings in 383.18 seconds pytest_vvv_not_slow_tuesday_october_11_results.txt

Over the next day I'll continue commenting on existing issues and making new ones (about 4 or 6) that relate to the 18 test failures. Those test failures do not delay the merge of this PR as some are "expected failures" (not marked as such in py.test but expected to fail because we have not completed all of data structure flexibility goals).

@gbrener Could you checkout this branch and run the py.test command in Py 3.6 / 2.7 locally and pipe your output to a similar file so we can check the number of failures is the same or explain why different. I constructed my env by install elm from the anaconda elm 3.5 dev branch to get the environment, then installed from this branch + xarray_filters PR 19