EaSearchCV - CV splitting is failing with Xarray inputs currently

In EaSearchCV (PR #192), cross validation fails when given a MLDataset / Dataset in a Pipeline or estimator, as the cross validation logic in dask-searchcv and/or scikit-learn subsets rows of a dask array/dataframe or numpy array. Things to consider:

@gbrener 's work in xarray_filters allows parameterization of chained transformers on an MLDataset / Dataset and that has utility in ML and non-ML contexts, e.g later using param to control transformers for viz. These transformers may be transforming a dataset with 1 or more DataArrays
- In scikit-learn Pipeline and its usage with dask or numpy, the cross validation tools are for 2D features array inputs generally, so the cross validation classes fail now when EaSearchCV is used on MLDataset/Dataset

Ideas: We want to allow both use styles, but cross validation can only be supported on the portion of a pipeline that has a 2D features matrix, e.g a numpy or dask array or MLDataset/Dataset with a single features DataArray whose .values array is given to cross validation tools. For example,in a Pipeline of:

Step 1: Spatial filters on 4-D array - custom function to set NaN where the 4-D arrays are out of domain, such as NaNs for ocean on a 4-D terrestrial DataArray(s)
Step 2: Parameterizable operation on the 4D arrays that allows Laplacian, gradient or no filter
Step 3: Call to_features on the MLDataset of 4-D arrays to convert to features matrix
Step 4: Drop the NaN rows of the .features 2D DataArray
Step 5: PCA
Step 6: KMeans

Hyperparameterization should be supported for any step but cross validation only of the steps 3 through 6 that use a typical feature matrix.

Here are some notes I took regarding cross validation, thinking about how to cross validation work with xarray/xarray_filters data structures. Currently this Pipeline is failing due to xarray. data structures being used on steps up to scaler (it runs fine as a Pipeline but fails in cross validation if used in GridSearchCV or EaSearchCV).

pipe = Pipeline([
    ('sampler', Sampler(max_time_steps=max_time_steps)),
    ('time', Differencing(layers=FEATURE_LAYERS)),
    ('flatten', Flatten()),
    ('soil_phys', AddSoilPhysicalChemical()),
    ('drop_null', DropNaRows()),
    ('get_y', GetY(SOIL_MOISTURE)),
    ('None', None),
    ('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
    ('pca', ChooseWithPreproc()),
    ('estimator', linear_model.LinearRegression(n_jobs=-1))])

In pseudocode this is what cross validating at Sampler level would look like:

for n in range(num_samples):
    dset = pipe.steps[0][1].fit_transform(**sample_args)
    for name, step in pipe.steps[:-1]:
        dset = step.fit_transform(dset)
    return pipe.steps[-1][1].fit(dset)

Currently sklearn cross validation iterators would support:

for n in range(num_splits):
    # Test / train split input array X
    # Run the `scaler`, `pca`, and `estimator`
    # steps on each test/train batch

Nested cross validation idea - cross validating at the input samples level (e.g. filenames that make up an xarray data structure in Sampler), as well as cross validation splitting of the input matrix:

def make_sample(n):
    return load_array('big_file_{}.nc'.format(n))

xarray_pipe = Pipeline([
    ('sampler', Sampler(func=make_sample),
    ('time', Differencing(layers=FEATURE_LAYERS)),
    ('flatten', Flatten()),
    ('soil_phys', AddSoilPhysicalChemical()),
    ('drop_null', DropNaRows()),
    ('get_y', GetY(SOIL_MOISTURE)),
    ('None', None),])
numpy_or_dask_pipe = Pipeline([
    ('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
    ('pca', ChooseWithPreproc()),
    ('estimator', linear_model.LinearRegression(n_jobs=-1))])

def nested_cross_val(outer_num_samples, inner_num_samples):
    for n in range(outer_num_samples):
        # This is the outer cross validation, e.g
        # over file names or dates that
        # determine a sample reading from file, for examples.
        # In the NLDAS ML
        sample = xarray_pipe[0][1].fit_transform(n)
        for name, step in xarray_pipe.steps[1:]:
            sample = step.fit_transform(sample)
        X, y = sample
        for n in range(inner_num_samples):  # e.g. KFold
            # test / train split X, y
            # run the steps in the `numpy_pipe`
            # This is the "inner cross validation"

Nested cross validation inside evolutionary search (pseudocode):

def ea_search_cv(outer_cv, inner_cv):
    pop = initialize()
    for generation in range(ngen):
        # Each generation in evo algo
        for model in pop:
            # Each member of population
            # Do outer / inner cross validations
            nested_cross_validation(outer_cv, inner_cv)
            scores = # accumulate two-layer cross validation scores
        # The EA search chooses the
        # best parameters based on cross validation scores
        pop = select_new_population(scores)
    return pop

ContinuumIO / elm

EaSearchCV - CV splitting is failing with Xarray inputs currently #204