dask / dask-searchcv

dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml
BSD 3-Clause "New" or "Revised" License
240 stars 43 forks source link

cross validation for xarray_filters.MLDataset - Elm PR 221 #61

Closed PeterDSteinberg closed 7 years ago

PeterDSteinberg commented 7 years ago

[Work in progress] - This is a PR corresponding to https://github.com/ContinuumIO/elm/pull/221 - cross validation for xarray-based data structures.

TODO:

mrocklin commented 7 years ago

@PeterDSteinberg it might be wise to answer some of the more general questions above, like

More specifically, could you lay out why the changes here are necessary?

This might be useful in order to determine if these changes are even in scope for the dask-searchcv project.

PeterDSteinberg commented 7 years ago

Hi @TomAugspurger @mrocklin . A bit of background.

The docs pipeline.rst link is out of date, see this test_xarray_cross_validation.py in elm PR 221 for current usage.

Why the changes are necessary?

This will allow cross validation using dask.distributed where cross validation is on Dataset/MLDataset structures. Example workflows would be using a sampler function to load 100 netcdf files with xarray.open_mfdataset as a Dataset with 3D climate arrays, using transformers that do stats / preprocessing on the 3D arrays before making a feature matrix passed to transformers/estimators that take a typical 2D tabular feature matrix, e.g. PCA or KMeans. In that example, the reduction from 3D arrays to 2D may involve time series averaging / binning and it would be nice to hyperparameterize the averaging/binning parameters.

Regarding:

Scikit-learn has an issue for allowing transforms on y

elm.pipeline aims to support transformers that return X, y tuples or X. In the example above with 3D arrays' feature engineering, there could be several steps passing a MLDataset of 3D arrays, then a step calling MLDataset.to_features() or other methods to create an X, y tuple. Passing y through a pipeline is required if using a sampler function. The changes in this PR related to _split_Xy are for passing X,y tuples or detecting whether they have been returned by a transformer (as opposed to just Dataset/Array-like X).

On:

How does elm.Pipeline's sample_weight relate to the (unimplemented) property routing in scikit-learn/scikit-learn#9566? (that PR is covering a lot, specifically the {'fit_params': sample_weights} bit.

elm.pipeline in Phase 1 (the time of that .rst file) supported passing sample_weight but it was brittle and was not consistent with cross validation ideas of sklearn or this PR. I will read up on that sklearn issue 9566 ( and scikit-learn/scikit-learn#4143).

I'm not in a major rush to get this merged and can work through any issues you think of.

mrocklin commented 7 years ago

I'm not in a major rush to get this merged and can work through any issues you think of.

It's not so much a timing issue as an "is this in scope" issue.

Are there ways to accomplish your goals without directly injecting elm code into this project? For example are there things that this project should generalize or protocols that we might adhere to that might be a little less specific to your project but would still allow you to make progress?

PeterDSteinberg commented 7 years ago

Agreed @mrocklin - I see your point - I don't want to cause scope creep / tech debt here (the time request being to look into that further). Currently the elm import is optional. I'll look at the issues above in further detail, but my impression now is that much of the logic here for splitting X,y tuples is also helpful for a pipeline where transformers can return y (unrelated to Dataset-like arguments).

PeterDSteinberg commented 7 years ago

Closing this on in favor of less wordy solution in #61

jbednar commented 6 years ago

Probably meant "in favor of #65", as this PR is #61.