Open MaxPowerWasTaken opened 6 years ago
I think the answer is yes, but I have one clarifying question: which part of dask-ml
are you using? Dask-ml doesn't have a custom pipeline object (yet), we just re-use sklearn.pipeline.Pipeline
. Are you using dask_ml.model_selection.GridSearchCV
, or something else?
Hey thanks for the quick response Tom. So I was missing something obvious; I should have noticed in your pipelines
doc section that pipeline
is still an sklearn pipeline object.
We're not using dask-ml yet but are looking to speed our current pandas/sklearn pipeline process up. We'll go ahead and try passing along dask dataframes out of each pipeline step instead of pandas dataframes for now, and then use dask_ml.model_selection.GridSearchCV
or dask_ml.model_selection.RandomizedSearchCV
for hyperparam search.
We're not using dask-ml yet but are looking to speed our current pandas/sklearn pipeline process up.
Sounds good. dask_ml.model_selection.GridSearchCV
and dask_ml.model_selection.RandomizedSearchCV
work equally well for dask and non-dask inputs. You should get the same speedup from avoiding redundant computation either way. And they should work on any scikit-learn estimator so your custom transformers should be fine.
Feel free to continue asking questions in this issue if you run into anything. If I could anticipate one issue: if you're looking to speed up your custom transformers, simply passing in dask objects may or may not improve things. It may require adapting your estimator to work on dask objects, so that the dask scheduler can ensure that things run in parallel. The RobustScalar transformer may be interesting to look at here: https://github.com/dask/dask-ml/blob/92fb532ac7e3b7b4bf074aff1b78f9e6519c27dd/dask_ml/preprocessing/data.py#L126
I think that the general question of "what is the contract for a dask-ml style estimator?" is an intersting one. For example with sklearn I expect to pass in and get back numpy arrays everywhere. If someone wanted to make an estimator that was compatible with existing dask-ml pipelines what types should it support and what types should it return?
If someone wanted to make an estimator that was compatible with existing dask-ml pipelines what types should it support and what types should it return?
Point 4 in http://dask-ml.readthedocs.io/en/latest/contributing.html#conventions (I think) addresses this.
Methods returning arrays (like .transform, .predict), should return the same type as the input. So if a dask.array is passed in, a dask.array with the same chunks should be returned.
That should be clarified for information on how dask dataframes are handled, but my general rule has been, "if the scikit-learn return value would blow up memory on a large dataset, then return a dask version instead" :)
Thanks for this awesome project!
I have a scikit-learn pipeline combining some custom transformers for feature-engineering with a classifier at the end (xgboost). Does Dask-ML accept user-defined pipeline step/classes like Sklearn does? If so, what are the requirements (e.g. "implement fit, transform, and return a Dask Dataframe?")
And are there any classes a Dask-ML pipeline step should inherit from (e.g. in sklearn all my custom transforms inherit from
BaseEstimator
in order to getget_params
... see: https://stackoverflow.com/a/39093021/1870832)The Dask-ML docs are pretty great in general but I couldn't find an answer or example on this. Sorry if I'm missing it somewhere.