ContinuumIO / elm

Phase I & part of Phase II of NASA SBIR - Parallel Machine Learning on Satellite Data
44 stars 23 forks source link

EaSearchCV - CV splitting is failing with Xarray inputs currently #204

Open PeterDSteinberg opened 7 years ago

PeterDSteinberg commented 7 years ago

In EaSearchCV (PR #192), cross validation fails when given a MLDataset / Dataset in a Pipeline or estimator, as the cross validation logic in dask-searchcv and/or scikit-learn subsets rows of a dask array/dataframe or numpy array. Things to consider:

Ideas: We want to allow both use styles, but cross validation can only be supported on the portion of a pipeline that has a 2D features matrix, e.g a numpy or dask array or MLDataset/Dataset with a single features DataArray whose .values array is given to cross validation tools. For example,in a Pipeline of:

Hyperparameterization should be supported for any step but cross validation only of the steps 3 through 6 that use a typical feature matrix.

PeterDSteinberg commented 7 years ago

Here are some notes I took regarding cross validation, thinking about how to cross validation work with xarray/xarray_filters data structures. Currently this Pipeline is failing due to xarray. data structures being used on steps up to scaler (it runs fine as a Pipeline but fails in cross validation if used in GridSearchCV or EaSearchCV).

pipe = Pipeline([
    ('sampler', Sampler(max_time_steps=max_time_steps)),
    ('time', Differencing(layers=FEATURE_LAYERS)),
    ('flatten', Flatten()),
    ('soil_phys', AddSoilPhysicalChemical()),
    ('drop_null', DropNaRows()),
    ('get_y', GetY(SOIL_MOISTURE)),
    ('None', None),
    ('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
    ('pca', ChooseWithPreproc()),
    ('estimator', linear_model.LinearRegression(n_jobs=-1))])

In pseudocode this is what cross validating at Sampler level would look like:

for n in range(num_samples):
    dset = pipe.steps[0][1].fit_transform(**sample_args)
    for name, step in pipe.steps[:-1]:
        dset = step.fit_transform(dset)
    return pipe.steps[-1][1].fit(dset)

Currently sklearn cross validation iterators would support:

for n in range(num_splits):
    # Test / train split input array X
    # Run the `scaler`, `pca`, and `estimator`
    # steps on each test/train batch

Nested cross validation idea - cross validating at the input samples level (e.g. filenames that make up an xarray data structure in Sampler), as well as cross validation splitting of the input matrix:

def make_sample(n):
    return load_array('big_file_{}.nc'.format(n))

xarray_pipe = Pipeline([
    ('sampler', Sampler(func=make_sample),
    ('time', Differencing(layers=FEATURE_LAYERS)),
    ('flatten', Flatten()),
    ('soil_phys', AddSoilPhysicalChemical()),
    ('drop_null', DropNaRows()),
    ('get_y', GetY(SOIL_MOISTURE)),
    ('None', None),])
numpy_or_dask_pipe = Pipeline([
    ('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
    ('pca', ChooseWithPreproc()),
    ('estimator', linear_model.LinearRegression(n_jobs=-1))])

def nested_cross_val(outer_num_samples, inner_num_samples):
    for n in range(outer_num_samples):
        # This is the outer cross validation, e.g
        # over file names or dates that
        # determine a sample reading from file, for examples.
        # In the NLDAS ML
        sample = xarray_pipe[0][1].fit_transform(n)
        for name, step in xarray_pipe.steps[1:]:
            sample = step.fit_transform(sample)
        X, y = sample
        for n in range(inner_num_samples):  # e.g. KFold
            # test / train split X, y
            # run the steps in the `numpy_pipe`
            # This is the "inner cross validation"

Nested cross validation inside evolutionary search (pseudocode):

def ea_search_cv(outer_cv, inner_cv):
    pop = initialize()
    for generation in range(ngen):
        # Each generation in evo algo
        for model in pop:
            # Each member of population
            # Do outer / inner cross validations
            nested_cross_validation(outer_cv, inner_cv)
            scores = # accumulate two-layer cross validation scores
        # The EA search chooses the
        # best parameters based on cross validation scores
        pop = select_new_population(scores)
    return pop