dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
890 stars 255 forks source link

Add cross_validate helper #251

Open TomAugspurger opened 6 years ago

TomAugspurger commented 6 years ago

sklearn.model_selection.cross_validate fits and scores several models over some CV splits of data.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

Users can currently do this distributed on a cluster with

import dask_ml.joblib

with joblib.parallel_backend('dask', scatter=[X, y]):
    sklearn.model_selection.cross_validate(estimator, X, y)

Why not do that for them by defining a dask_ml.model_selection.cross_validate that does the parallel_backend and scattering for them?

TomAugspurger commented 6 years ago

We would do the same for cross_val_score and cross_val_predict.

TomAugspurger commented 6 years ago

Why not do that for them...?

One reason might be if we want to implement a cross_validate that's specifically designed to work with with distributed data (RAM bound). This little wrapper would only be helpful for CPU bound workloads. But we could presumably dispatch to our implementation when given dask collections.

gglanzani commented 6 years ago

But we could presumably dispatch to our implementation when given dask collections.

@TomAugspurger Is there a dask implementation that works specifically for dask collections? Or is it in the making/will happen some day?

TomAugspurger commented 6 years ago

Is there a dask implementation that works specifically for dask collections?

Dask-ML currently only implements ShuffleSplit for dask arrays (not dataframes).

Opened https://github.com/dask/dask-ml/issues/269 for implementing additional splitters.


For this issue, I would say that dask_ml.model_selection.cross_validate could start off with checking if is_dask_collection(X) or is_dask_collection(y): raise NotImplementedError with a nice message.