Closed PeterDSteinberg closed 7 years ago
@PeterDSteinberg it might be wise to answer some of the more general questions above, like
More specifically, could you lay out why the changes here are necessary?
This might be useful in order to determine if these changes are even in scope for the dask-searchcv project.
Hi @TomAugspurger @mrocklin . A bit of background.
The docs pipeline.rst link is out of date, see this test_xarray_cross_validation.py in elm PR 221 for current usage.
Why the changes are necessary?
This will allow cross validation using dask.distributed where cross validation is on Dataset/MLDataset structures. Example workflows would be using a sampler
function to load 100 netcdf files with xarray.open_mfdataset
as a Dataset with 3D climate arrays, using transformers that do stats / preprocessing on the 3D arrays before making a feature matrix passed to transformers/estimators that take a typical 2D tabular feature matrix, e.g. PCA or KMeans. In that example, the reduction from 3D arrays to 2D may involve time series averaging / binning and it would be nice to hyperparameterize the averaging/binning parameters.
Regarding:
Scikit-learn has an issue for allowing transforms on y
elm.pipeline
aims to support transformers that return X, y
tuples or X
. In the example above with 3D arrays' feature engineering, there could be several steps passing a MLDataset of 3D arrays, then a step calling MLDataset.to_features()
or other methods to create an X, y
tuple. Passing y
through a pipeline is required if using a sampler
function. The changes in this PR related to _split_Xy are for passing X,y tuples or detecting whether they have been returned by a transformer (as opposed to just Dataset/Array-like X).
On:
How does elm.Pipeline's sample_weight relate to the (unimplemented) property routing in scikit-learn/scikit-learn#9566? (that PR is covering a lot, specifically the {'fit_params': sample_weights} bit.
elm.pipeline
in Phase 1 (the time of that .rst file) supported passing sample_weight
but it was brittle and was not consistent with cross validation ideas of sklearn or this PR. I will read up on that sklearn issue 9566 ( and scikit-learn/scikit-learn#4143).
I'm not in a major rush to get this merged and can work through any issues you think of.
I'm not in a major rush to get this merged and can work through any issues you think of.
It's not so much a timing issue as an "is this in scope" issue.
Are there ways to accomplish your goals without directly injecting elm code into this project? For example are there things that this project should generalize or protocols that we might adhere to that might be a little less specific to your project but would still allow you to make progress?
Agreed @mrocklin - I see your point - I don't want to cause scope creep / tech debt here (the time request being to look into that further). Currently the elm
import is optional. I'll look at the issues above in further detail, but my impression now is that much of the logic here for splitting X,y tuples is also helpful for a pipeline where transformers can return y
(unrelated to Dataset-like arguments).
Closing this on in favor of less wordy solution in #61
Probably meant "in favor of #65", as this PR is #61.
[Work in progress] - This is a PR corresponding to https://github.com/ContinuumIO/elm/pull/221 - cross validation for xarray-based data structures.
TODO: