Does Dask-ML support Custom Transformers to Use in Pipeline?

dask / dask-ml

Scalable Machine Learning with Dask

http://ml.dask.org

BSD 3-Clause "New" or "Revised" License

893 stars 255 forks source link

Does Dask-ML support Custom Transformers to Use in Pipeline? #134

Open MaxPowerWasTaken opened 6 years ago

MaxPowerWasTaken commented 6 years ago

Thanks for this awesome project!

I have a scikit-learn pipeline combining some custom transformers for feature-engineering with a classifier at the end (xgboost). Does Dask-ML accept user-defined pipeline step/classes like Sklearn does? If so, what are the requirements (e.g. "implement fit, transform, and return a Dask Dataframe?")

And are there any classes a Dask-ML pipeline step should inherit from (e.g. in sklearn all my custom transforms inherit from BaseEstimator in order to get get_params... see: https://stackoverflow.com/a/39093021/1870832)

The Dask-ML docs are pretty great in general but I couldn't find an answer or example on this. Sorry if I'm missing it somewhere.

TomAugspurger commented 6 years ago

I think the answer is yes, but I have one clarifying question: which part of dask-ml are you using? Dask-ml doesn't have a custom pipeline object (yet), we just re-use sklearn.pipeline.Pipeline. Are you using dask_ml.model_selection.GridSearchCV, or something else?

MaxPowerWasTaken commented 6 years ago

Hey thanks for the quick response Tom. So I was missing something obvious; I should have noticed in your pipelines doc section that pipeline is still an sklearn pipeline object.

We're not using dask-ml yet but are looking to speed our current pandas/sklearn pipeline process up. We'll go ahead and try passing along dask dataframes out of each pipeline step instead of pandas dataframes for now, and then use dask_ml.model_selection.GridSearchCV or dask_ml.model_selection.RandomizedSearchCV for hyperparam search.

TomAugspurger commented 6 years ago

We're not using dask-ml yet but are looking to speed our current pandas/sklearn pipeline process up.

Sounds good. dask_ml.model_selection.GridSearchCV and dask_ml.model_selection.RandomizedSearchCV work equally well for dask and non-dask inputs. You should get the same speedup from avoiding redundant computation either way. And they should work on any scikit-learn estimator so your custom transformers should be fine.

Feel free to continue asking questions in this issue if you run into anything. If I could anticipate one issue: if you're looking to speed up your custom transformers, simply passing in dask objects may or may not improve things. It may require adapting your estimator to work on dask objects, so that the dask scheduler can ensure that things run in parallel. The RobustScalar transformer may be interesting to look at here: https://github.com/dask/dask-ml/blob/92fb532ac7e3b7b4bf074aff1b78f9e6519c27dd/dask_ml/preprocessing/data.py#L126

mrocklin commented 6 years ago

I think that the general question of "what is the contract for a dask-ml style estimator?" is an intersting one. For example with sklearn I expect to pass in and get back numpy arrays everywhere. If someone wanted to make an estimator that was compatible with existing dask-ml pipelines what types should it support and what types should it return?

TomAugspurger commented 6 years ago

If someone wanted to make an estimator that was compatible with existing dask-ml pipelines what types should it support and what types should it return?

Point 4 in http://dask-ml.readthedocs.io/en/latest/contributing.html#conventions (I think) addresses this.

Methods returning arrays (like .transform, .predict), should return the same type as the input. So if a dask.array is passed in, a dask.array with the same chunks should be returned.

That should be clarified for information on how dask dataframes are handled, but my general rule has been, "if the scikit-learn return value would blow up memory on a large dataset, then return a dask version instead" :)