Open PeterDSteinberg opened 7 years ago
Here are some notes I took regarding cross validation, thinking about how to cross validation work with xarray/xarray_filters data structures. Currently this Pipeline
is failing due to xarray. data structures being used on steps up to scaler
(it runs fine as a Pipeline but fails in cross validation if used in GridSearchCV or EaSearchCV).
pipe = Pipeline([
('sampler', Sampler(max_time_steps=max_time_steps)),
('time', Differencing(layers=FEATURE_LAYERS)),
('flatten', Flatten()),
('soil_phys', AddSoilPhysicalChemical()),
('drop_null', DropNaRows()),
('get_y', GetY(SOIL_MOISTURE)),
('None', None),
('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
('pca', ChooseWithPreproc()),
('estimator', linear_model.LinearRegression(n_jobs=-1))])
In pseudocode this is what cross validating at Sampler
level would look like:
for n in range(num_samples):
dset = pipe.steps[0][1].fit_transform(**sample_args)
for name, step in pipe.steps[:-1]:
dset = step.fit_transform(dset)
return pipe.steps[-1][1].fit(dset)
Currently sklearn cross validation iterators would support:
for n in range(num_splits):
# Test / train split input array X
# Run the `scaler`, `pca`, and `estimator`
# steps on each test/train batch
Nested cross validation idea - cross validating at the input samples level (e.g. filenames that make up an xarray data structure in Sampler), as well as cross validation splitting of the input matrix:
def make_sample(n):
return load_array('big_file_{}.nc'.format(n))
xarray_pipe = Pipeline([
('sampler', Sampler(func=make_sample),
('time', Differencing(layers=FEATURE_LAYERS)),
('flatten', Flatten()),
('soil_phys', AddSoilPhysicalChemical()),
('drop_null', DropNaRows()),
('get_y', GetY(SOIL_MOISTURE)),
('None', None),])
numpy_or_dask_pipe = Pipeline([
('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
('pca', ChooseWithPreproc()),
('estimator', linear_model.LinearRegression(n_jobs=-1))])
def nested_cross_val(outer_num_samples, inner_num_samples):
for n in range(outer_num_samples):
# This is the outer cross validation, e.g
# over file names or dates that
# determine a sample reading from file, for examples.
# In the NLDAS ML
sample = xarray_pipe[0][1].fit_transform(n)
for name, step in xarray_pipe.steps[1:]:
sample = step.fit_transform(sample)
X, y = sample
for n in range(inner_num_samples): # e.g. KFold
# test / train split X, y
# run the steps in the `numpy_pipe`
# This is the "inner cross validation"
Nested cross validation inside evolutionary search (pseudocode):
def ea_search_cv(outer_cv, inner_cv):
pop = initialize()
for generation in range(ngen):
# Each generation in evo algo
for model in pop:
# Each member of population
# Do outer / inner cross validations
nested_cross_validation(outer_cv, inner_cv)
scores = # accumulate two-layer cross validation scores
# The EA search chooses the
# best parameters based on cross validation scores
pop = select_new_population(scores)
return pop
In
EaSearchCV
(PR #192), cross validation fails when given a MLDataset / Dataset in a Pipeline or estimator, as the cross validation logic indask-searchcv
and/or scikit-learn subsets rows of a dask array/dataframe or numpy array. Things to consider:param
to control transformers for viz. These transformers may be transforming a dataset with 1 or moreDataArray
sIdeas: We want to allow both use styles, but cross validation can only be supported on the portion of a pipeline that has a 2D features matrix, e.g a numpy or dask array or MLDataset/Dataset with a single
features
DataArray whose.values
array is given to cross validation tools. For example,in a Pipeline of:to_features
on the MLDataset of 4-D arrays to convert to features matrix.features
2D DataArrayHyperparameterization should be supported for any step but cross validation only of the steps 3 through 6 that use a typical feature matrix.