Cross validation for xarray.* data structures

PeterDSteinberg commented 7 years ago

Elm PR #192 added tests of Pipeline and EaSearchCV for xarray and numpy data structures (see #202 for goals there). Some of the tests on xarray based data structures were failing when they related to cross validation. Cross validation iterators from sklearn depend on having typically a 2D X matrix that is sliced into training and test subsets.

Implement cross validation for xarray data structures by creating functions that split an iterable of arguments to a sampler, where those functions use KFold or other cross validation iterators from sklearn.model_selection.

An example usage is below (taken from wiki):

sampler = Sampler()
pipe = Pipeline([('sampler', sampler),
                  ('set_nans', SetNan()),
                  ('radiance', Radiance()),
                  ('normed_diffs', NormedDiffs()),
                  ('choose', ChooseBands(include_normed_diffs=True)),
                  ('drop_na', DropRows()),
                  ('standard', steps.preprocessing.StandardScaler()),
                  ('pca', steps.decomposition.PCA(n_components=5)),
                  ('est', steps.cluster.MiniBatchKMeans())])

X = pipe.fit(SAMPLE)
ea = EaSearchCV(pipe,
                param_distributions=param_distributions,
                ngen=2,
                model_selection=model_selection,
                cv=5)
Xt, y = ea.fit(SAMPLES)

In the example above SAMPLES could be a list of filenames or datetime/spatial arguments a function needs to make a sample xarray_filters.MLDataset where that list. Inside EaSearchCV or its reference to daskml, the SAMPLES iterable would be divided by KFold (default here because an integer 5 is given as cv) or by passing an iterator from sklearn.model_selection, e.g. cv=StratifiedFold(7).

PeterDSteinberg commented 7 years ago

See this notebook showing current status of EaSearchCV

PeterDSteinberg commented 7 years ago

Duplicate of #204

ContinuumIO / elm

Cross validation for xarray.* data structures #215