Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset

PeterDSteinberg commented 7 years ago

Work in progress to fix #204

PeterDSteinberg commented 7 years ago

Currently status of tests (for a simple Pipeline of only one unsupervised estimator step) - these are mostly failing due to test harness not putting together all the requisite arguments for the cross validators (such as not giving it a grouping variable):

test_xarray_cross_validation.py::test_each_cv[GroupKFold] PASSED
test_xarray_cross_validation.py::test_each_cv[GroupShuffleSplit] PASSED
test_xarray_cross_validation.py::test_each_cv[KFold] PASSED
test_xarray_cross_validation.py::test_each_cv[LeaveOneGroupOut] PASSED
test_xarray_cross_validation.py::test_each_cv[LeavePGroupsOut] FAILED
test_xarray_cross_validation.py::test_each_cv[LeaveOneOut] PASSED
test_xarray_cross_validation.py::test_each_cv[LeavePOut] FAILED
test_xarray_cross_validation.py::test_each_cv[PredefinedSpl\u0192it] FAILED
test_xarray_cross_validation.py::test_each_cv[RepeatedKFold] PASSED
test_xarray_cross_validation.py::test_each_cv[RepeatedStratifiedKFold] FAILED
test_xarray_cross_validation.py::test_each_cv[ShuffleSplit] PASSED
test_xarray_cross_validation.py::test_each_cv[StratifiedKFold] FAILED
test_xarray_cross_validation.py::test_each_cv[StratifiedShuffleSplit] FAILED
test_xarray_cross_validation.py::test_each_cv[TimeSeriesSplit] PASSED
test_xarray_cross_validation.py::test_each_cv[MLDatasetMixin] FAILED
test_xarray_cross_validation.py::test_each_cv[CVCacheSampleId] FAILED

PeterDSteinberg commented 7 years ago

I'm going to add more tests using pytest.mark.parametrize to better quantify what Pipeline options, such as supervised vs unsupervised, with MLDataset and cross validation.

PeterDSteinberg commented 7 years ago

Added the several label encoding classes totest_config.yaml in (SKIP) - (LabelBinarizer, LabelEncoder, SelectFromModel). I think the test harness is not preparing the right input data - haven't looked into yet.

The new test module xarray_cross_validation.py tests cross validation in Pipelines that use xarray_filters.MLDataset (or xarray.Dataset that are converted to xarray_filters.MLDataset). Typically when cross validation is used with GridSearchCV or other estimators, a large tabular matrix is given as input feature matrix and cross validation iterators from sklearn.model_selection, e.g. sklearn.model_selection.KFold, are used to split the input feature matrix rows into train / test batches.

When hyperparameterizing a Pipeline of operations on an MLDataset, cross validation requires that a sampler callable be passed to EaSearchCV initialization method and an iterable of sampler arguments to EaSearchCV.fit. Repeated calls to sampler are used for form train / test batches. An outstanding issue to fix (before the dask-searchcv PR 61 can be merged) is the usage of refit=True as an argument to EaSearchCV when cross validating Pipelines that use MLDataset in steps. See the TODO note in test_xarray_cross_validation.py regarding refit=True.

refit_options = (False,) # TODO - refit is not working because
                         # it is passing sampler arguments not
                         # sampler output to the refitting
                         # of best model logic.  We need
                         # to make separate issue to figure
                         # out what "refit" means in a fitting
                         # operation of many samples - not
                         # as obvious what that should be
                         # when not CV-splitting a large matrix
                         # but rather CV-splitting input file
                         # names or other sampler arguments
test_args = product(CV_CLASSES, configs, refit_options)

The problem above with refit=True prevents EaSearchCV.predict from running (the best estimator has not been refit for prediction). When the issue above is fixed, hopefully that also means this part of test_ea_search.py:

test_args = product(args, (None,))

can be changed to:

test_args = product(args, ('predict', None)) # Test "refit"=True and predict(...)

I'll open issues and link them here:

Test refit=True when running EaSearchCV Pipelines passing MLDataset between steps
Label encoding preprocessors from sklearn.preprocessing - (LabelBinarizer, LabelEncoder, SelectFromModel)

I'm running this PR with:

To run the tests:

cd elm/tests && py.test -m "not slow" -vvv

Test summary

============================= 126 tests deselected =============================
==== 1850 passed, 23 skipped, 126 deselected, 12 warnings in 526.52 seconds ====

PeterDSteinberg commented 7 years ago

Notes:

Python 3.5 and 3.6 CI tests above passed (running a subset of the test_pipeline.py generated tests - see also #227 )
Python 2.7 failed due to https://github.com/dask/dask-searchcv/issues/64

PeterDSteinberg commented 7 years ago

Replaced by #228

ContinuumIO / elm

Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset #221